0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.

I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.

How does Impala provide faster query response compared to Hive for the same data on HDFS?

1 Answer

0 votes
by (24.8k points)

Impala has its own daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

So, while processing SQL-like queries, It does not write intermediate results on disk, instead Impala does full SQL processing in memory directly, which helps its daemons to return data very quickly without even going through all the MapReduce jobs.

On the other hand, Hive uses underlying Map Reduce architecture for processing data which increases an extra layer to go through. This is the only reason why Impala gets an edge over Hive in terms of processing speed.

But Impala is not used for analyzing large datasets, It is only used for running queries on HDFS and Apache HBase as it does not require data to be transformed.

It can be a great tool to process some small ad-hoc queries but when it comes to perform data intensive task, where you need to analyze and process large dataset, Hive is your guy. Hive greatly simplifies the data processing tasks at scale.


 

image

...