Pig and Hive are open source platform mainly used for same purpose. These tools that ease the complexity of writing difficult/complexed programs of java based MapReduce. Hive is like a data warehouse that uses the MapReduce for the purpose of analyzing data stored on HDFS. It provides a query language called HiveQL that is familiar to the Structured Query Language (SQL) standard. It is developed based on facebook concepts. Hive was created who are posing strong analysts having strong SQL skills but few java programming skills are required to run queries on the large volumes of data that Face book stored in HDFS. Apache Pig and Hive are two projects that are consider as the top most layer of Hadoop and provide a higher-level language for using MapReduce library of Hadoop management.
It consists of a query language based on the standard SQL instead of giving a rapid development of map and reduces tasks. Hive takes HiveQL statements and then automatically transforms each and every query into one or more MapReduce jobs. Later it runs the overall MapReduce program and executes the output to the user whereas Hadoop streaming decreases the mandatory code, compile, and submit cycle. Hive removes it completely instead requires only the composition of HiveQL statements.
This interface to Hadoop not only accelerates the time required to produce results from data analysis but also it significantly expands for whom this Hadoop and MapReduce are helpful.
What makes Hive Hadoop popular?
- The users are provided with strong and powerful statistics functions.
- It is similar to SQL and hence it is very easy to understand the concepts.
- It can be combined with the HBase for querying the data in HBase. This kind of feature is not available in pig. Pig function named HbaseStorage () is mainly used for loading the data from HBase.
- Supported by Hue.
- Various user groups are considered such as CNET, Last.fm, Facebook, and Digg etc.
Difference between hive and pig
|Used for Data Analysis||Used for Data and Programs|
|Used as Structured Data||Pig is Semi-Structured Data|
|Hive has HiveQL||Pig has Latin|
|Hive is used for creating reports||Pig is used for programming|
|Hive works on the server side||Pig works on the client side|
|Hive does not support avro||Pig supports Avro|
hive>select * form employee;
hive> describe employee;
- The Apache Hive is mainly data warehouse software which allows you to read, write and manage huge number volumes of datasets stored in a distributed environment using SQL. It is possible to project structure onto data that is termed as storage. Users can be connected to Hive using a JDBC driver and a command line tool.
- Hive is an open Source platform system. Use Hive for analyzing and querying in large number of datasets consisting the Hadoop files. It’s similar to the SQL programming. The current version of Hive is 0.13.1.
- Hive supports ACID transaction: Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, those are Insert, Delete, and Update options so that Hive supports ACID transaction.
- Hive is not considered as a complete database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do in the field of programming.
Hive is most suitable for following data warehouse applications
- Analyzing the static data
- Less Responsive time
- No rapid changes in datasets.
Hive doesn’t provide fundamental features required for OLTP (Online Transaction Processing). Hive is proper usage for data warehouse applications in large data sets.
The two types of tables in Hive
- Managed table
- External table
We can change the settings within Hive session, using the command known as SET. It is used to change Hive job settings for a query to gain the exact results.
Example: The following below commands shows buckets are occupied according to the table definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by using the value of SET with the property name. SET will allows to list all the properties with their values set by Hive.
hive> SET hive.enforce.bucketing; hive.enforce.bucketing=true
And this above list will not be include by defaults of Hadoop. So we should use the below as follows:
It will list all the properties including Hadoop functioning defaults in the system.