Hadoop Hive
Apache Hive is an open-source data warehouse system that has been built on top of Hadoop. You can use Hive for analyzing and querying large datasets that are stored in Hadoop files. Processing structured and semi-structured data can be done by using Hive.
Let’s look at the agenda for this section first:
Now, let’s start with this Apache Hive tutorial.
What is Hive in Hadoop?
Don’t you think writing MapReduce jobs is tedious work? Well, with Hadoop Hive, you can just go ahead and submit SQL queries and perform MapReduce jobs. So, if you are comfortable with SQL, then Hive is the right tool for you as you will be able to work on MapReduce tasks efficiently. Similar to Pig, Hive has its own language, called HiveQL (HQL). It is similar to SQL. HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin does. The best part is that you don’t need to learn Java to work with Hadoop Hive.
Watch this video on HIVE by Intellipaat:
Hadoop Hive runs on our system and converts SQL queries into a set of jobs for execution on a Hadoop cluster. Basically, Hadoop Hive classifies data into tables providing a method for attaching the structure to data stores in HDFS.
Facebook uses Hive to address its various requirements, like running thousands of tasks on the cluster, along with thousands of users for a huge variety of applications. Since Facebook has a huge amount of raw data, i.e., 2 PB, Hadoop Hive is used for storing this voluminous data. It regularly loads around 15 TB of data on a daily basis. Now, many companies, such as IBM, Amazon, Yahoo!, and many others, are also using and developing Hive.
Why do we need Hadoop Hive?
Let’s now talk about the need for Hive. To understand that, let’s see what Facebook did with its big data.
Basically, there were a lot of challenges faced by Facebook before they had finally implemented Apache Hive. One of those challenges was the size of data that has been generated on a daily basis. Traditional databases, such as RDBMS and SQL, weren’t able to handle the pressure of such a huge amount of data. Because of this, Facebook was looking for better options. It started using MapReduce in the beginning to overcome this problem. But, it was very difficult to work with MapReduce as it needed mandatory programming expertise in SQL. Later on, Facebook realized that Hadoop Hive had the potential to actually overcome the challenges it faced.
Apache Hive helps developers get away with writing complex MapReduce tasks. Hadoop Hive is extremely fast, scalable, and extensible. Since Apache Hive is comparable to SQL, it is easy for the SQL developers as well to implement Hive queries.
Additionally, the Hive is capable of decreasing the complexity of MapReduce by providing an interface wherein a user can submit various SQL queries. So, technically, you don’t need to learn Java for working with Apache Hive.
Hive Architecture
Let’s now talk about the Hadoop Hive architecture and the major working force behind Apache Hive.
The components of Apache Hive are as follows:
-
- Driver: The driver acts as a controller receiving HiveQL statements. It begins the execution of statements by creating sessions. It is responsible for monitoring the life cycle and the progress of the execution. Along with that, it also saves the important metadata that has been generated during the execution of the HiveQL statement.
- Metastore: A metastore stores metadata of all tables. Since Hive includes partition metadata, it helps the driver in tracking the progress of various datasets that have been distributed across a cluster, hence keeping track of data. In a metastore, the data is saved in an RDBMS format.
- Compiler: The compiler performs the compilation of a HiveQL query. It transforms the query into an execution plan that contains tasks.
- Optimizer: An optimizer performs many transformations on the execution plan for providing an optimized DAG. An optimizer aggregates several transformations together like converting a pipeline of joins to a single join. It can also split the tasks for providing better performance.
- Executor: After the processes of compilation and optimization are completed, the execution of the task is done by the executor. It is responsible for pipelining the tasks.
Get 100% Hike!
Master Most in Demand Skills Now!
Differences Between Hive and Pig
Hive |
Pig |
Used for data analysis |
Pig is used for data and programs |
Used for processing the structured data |
It is used for the semi-structured data |
Has HiveQL |
Has Pig Latin |
Used for creating reports |
Used for programming |
Works on the server side |
Works on the client side |
Does not support Avro |
Supports Avro |
Features of Apache Hive
Let’s now look at the features of Apache Hive:
- Hive provides easy data summarization and analysis and query support.
- Hive supports external tables, making it feasible to process data without having to store it into HDFS.
- Since Hadoop has a low-level interface, Hive fits in here properly.
- Hive supports the partitioning of data at the data level for better performance.
- There is a rule-based optimizer present in Hive responsible for optimizing logical plans.
- Hadoop can process external data using Hive.
Limitations of Apache Hive
Though Hive is a progressive tool, it has some limitations as well.
- Apache Hive doesn’t offer any real-time queries.
- Online transaction processing is not well-supported by Apache Hive.
- There can be a delay while performing Hive queries.
That is all for this Apache Hive tutorial. In this section about Apache Hive, you learned about Hive that is present on top of Hadoop and is used for data analysis. It uses a language called HiveQL that translates SQL-like queries into relevant MapReduce jobs. In the upcoming section of this Hadoop tutorial, you will be learning about Hadoop clusters.