Apache Hive is an open-source data warehouse system that has been built on top of Hadoop. It is used majorly for analyzing and querying large datasets that have been stored in Hadoop files. Hadoop Hive is used for processing structured and semi-structured data.
Let’s look at the agenda for this section first:
- Hadoop Hive
- What is Hive in Hadoop?
- Why do we need Hadoop Hive?
- Hive Architecture
- Differences Between Hive and Pig
- Features of Apache Hive
- Limitations of Apache Hive
Now, let’s start with this Apache Hive tutorial.
What is Hive in Hadoop?
Don’t you think writing MapReduce jobs is tedious work? Well, with Hadoop Hive, we can just go ahead and submit SQL queries and perform MapReduce jobs. So, if we are comfortable with SQL, then Hive is the right tool for us as we will be able to work on MapReduce tasks efficiently. Similar to Pig, Hive has its own language, called HiveQL (HQL). It is similar to SQL. HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin does. The best part is that we don’t need to learn Java to work with Hadoop Hive.
Hadoop Hive runs on our system and converts SQL queries into a set of jobs for execution on a Hadoop cluster. Basically, Hadoop Hive classifies data into tables providing a method for attaching the structure to data stores in HDFS.
Facebook uses Hive to address its various requirements, like running thousands of tasks on the cluster, along with thousands of users for a huge variety of applications. Since Facebook has a huge amount of raw data, around 2 PB, Hadoop Hive is used there for storing this voluminous data. It regularly loads around 15 TB of data on a daily basis. Now, many companies, such as IBM, Amazon, Yahoo!, and many others, are also using and developing Hive.
Why do we need Hadoop Hive?
Let’s now talk about the need for Hive. To understand that, let’s see what Facebook did with its big data.
Basically, there were a lot of challenges faced by Facebook before they had finally implemented Apache Hive. One of those challenges was the size of data that has been generated on a daily basis. Traditional databases, such as RDBMS and SQL, weren’t able to handle the pressure of such a huge amount of data. Because of this, Facebook was looking for better options. It started using MapReduce in the beginning to overcome this problem. But, it was very difficult to work with MapReduce as it needed mandatory programming expertise in SQL. Later on, Facebook realized that Hadoop Hive had the potential to actually overcome the challenges it faced.
Apache Hive helps developers get away with writing complex MapReduce tasks. Hadoop Hive is extremely fast, scalable, and extensible. Since Apache Hive is comparable to SQL, it is easy for the SQL developers as well to implement Hive queries.
Additionally, the Hive is capable of decreasing the complexity of MapReduce by providing an interface wherein a user can submit various SQL queries. So, technically, we don’t need to learn Java for working with Apache Hive.
Watch this video on HIVE by Intellipaat:
Let’s now talk about the Hadoop Hive architecture and the major working force behind Apache Hive.
The components of Apache Hive are as follows:
- Driver: The driver acts as a controller receiving HiveQL statements. It begins the execution of statements by creating sessions. It is responsible for monitoring the life cycle and the progress of the execution. Along with that, it also saves the important metadata that has been generated during the execution of the HiveQL statement.
- Metastore: A metastore stores metadata of all tables. Since Hive includes partition metadata, it helps the driver in tracking the progress of various datasets that have been distributed across a cluster, hence keeping track of data. In a metastore, the data is saved in an RDBMS format.
- Compiler: The compiler performs the compilation of a HiveQL query. It transforms the query into an execution plan that contains tasks.
- Optimizer: An optimizer performs many transformations on the execution plan for providing an optimized DAG. An optimizer aggregates several transformations together like converting a pipeline of joins to a single join. It can also split the tasks for providing better performance.
- Executor: After the processes of compilation and optimization are completed, the execution of the task is done by the executor. It is responsible for pipelining the tasks.
Differences Between Hive and Pig
|Used for data analysis||Used for data and programs|
|Used for processing the structured data||Used for the semi-structured data|
|Has HiveQL||Has Pig Latin|
|Used for creating reports||Used for programming|
|Works on the server side||Works on the client side|
|Does not support Avro||Supports Avro|
Features of Apache Hive
Let’s now look at the features of Apache Hive:
- Easy data summarization and analysis and query support are provided by Hive.
- External tables are supported by Hive, making it feasible to process data without having to store it into HDFS.
- Since Hadoop has a low-level interface, Hive fits in here properly.
- Hive supports the partitioning of data at the data level for better performance.
- There is a rule-based optimizer present in Hive responsible for optimizing logical plans.
- The structured data can be processed in Hadoop using Hive.
- In Hive, ad-hoc queries can also be run for data analysis.
Limitations of Apache Hive
Though Hive is a progressive tool, it has some limitations as well.
- Apache Hive doesn’t offer any real-time queries.
- Online transaction processing is not well-supported by Apache Hive.
- Hive queries can sometimes be delayed.
That is all for this Apache Hive tutorial. In this section about Apache Hive, we learned that Hive is built on top of Hadoop to be used for data analysis. It uses a language called HiveQL that translates SQL-like queries into relevant MapReduce jobs. In the upcoming section of this Hadoop tutorial, we will be learning about Hadoop clusters.