Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)
edited by
I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS.
My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with either one of those? Not only concerning performance, but also with respect of stability?

1 Answer

0 votes
by (32.3k points)
edited by

Hive is not developed for real-time, in memory processing and is based on MapReduce. It was built for offline batch processing kind of thing.

On the other hand these tools were developed keeping the real-timeness in mind. Go for them when you need to query not very huge data, that can fit into the memory, real-time. I'm not saying you can't run queries on your BigData using these tools, but you would be pushing the limits if you are running real-time queries on PBs of data, IMHO.

Now, talking about your question, the goals behind developing Hive and these tools were different. 

For example,

Impala is a MPP(Massive Parallel Processing) SQL query engine which is used to process huge amount of data that is stored in Hadoop cluster. It is an open source software, written in C++ and Java. It provides high performance and low latency as compared to other SQL engines for Hadoop. 

  • With Impala, users can communicate with HBase or HDFS using SQL queries in a faster way as compared to other SQL engines like Hive.

  • Impala can read almost all the file formats such as RCFile,Parquet, Avro used by Hadoop.

Impala uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface as Apache Hive, that enables Impala to provide a familiar and unified platform for batch-oriented or real-time queries.

Impala is not based on MapReduce algorithms, as Hive. It implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution that run on the same machines.

Whereas Drill was developed not just to be a Hadoop project but it was developed to provide us a distributed query capabilities across multiple big data platforms including MongoDB, Cassandra, Riak and Splunk.

Apache Drill (inspired by Google Dremel) is the first distributed SQL query engine that has a schema-free JSON model like Elastic search or MongoDB NoSQL. Using Apache Drill, data can be queried just by mentioning the path in the SQL query to a NOSQL database, Amazon S3 bucket or a Hadoop directory. Apache Drill defines the schema on-the-go (schema discovery) so that users can directly query the data unlike traditional SQL query engines where the schema has to be taken into consideration first. While using Apache Drill, developers need not have to code and build applications like Hive to extract data. Here, normal SQL queries will help the user in getting data from any data source and in any specific format.

Apache shark is a distributed query engine that is mainly used for Hadoop data. It provides enhanced performance and high-end analytical results to Hive users. It is compatible with Apache Hive, that allows you to query it using the same HiveQL statements as you would through Hive. The difference is that Shark can return results up to 30 times faster as compared to Hive.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...