Difference Between Pig, Hive, and Sqoop : Detail Comparison

Apache Pig

The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data. Pig also consists of the infrastructure to evaluate the programs. The advantage of Pig programming is that it can easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically done using the textual language Pig Latin.

Pig Latin features

Simple programming: it is easy to code, execute and manage
Better optimization: The system can automatically optimize the execution
Extensive nature: it can be used to achieve highly specific processing tasks

Watch this video on HIVE by Intellipaat:

Pig can be used for the following purposes:

ETL data pipeline
Research on raw data
Iterative processing.

The scalar data types in pig are int, float, double, long, chararray, and bytearray. The complex data types in Pig are map, tuple, and bag.

Map: The data element with the data type chararray where an element has pig data type include complex data type

Example- [city’#’bang’,’pin’#560001]

In this city and pin is data element mapping to values.

Tuple: It is a collection of data types and has a fixed length. A tuple is having multiple fields and these are ordered.

Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by a comma

Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

LOAD function:

Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language, we need to mention the input, which is completed by using the ‘load’ keyword.

The LOAD syntax is

LOAD ‘mydata’ [USING function] [AS schema];

Example- A = LOAD ‘intellipaat.txt’;

A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘t’);

The relational operations in Pig:

foreach, order by, filters, group, distinct, join, limit.

foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.

A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );

B = foreach A generate emp_name, emp_id;

Filters: It contains a predicate and allows us to select which records will be retained in our data pipeline.

Syntax: alias = FILTER alias BY expression;

Alias indicates the name of the relation, By indicating the required keyword, and the expression has Boolean.

Example: M = FILTER N BY F5 == 4;

Apache Sqoop

Apache Sqoop is a tool that is extensively used to transfer large amounts of data from Hadoop to the relational database servers and vice-versa. Sqoop can be used to import the various types of data from Oracle, MySQL, and other databases.

Important Sqoop control commands to import RDBMS data

Append: Append data to an existing dataset in HDFS. –append
Columns: columns to import from the table. –columns <col,col……>
Where: where clause to use during import. –where <where clause>

The common large objects in Sqoop are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set the lob limit as ZERO (0) then it is stored in external memory.

Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is

--columns <col1,col2……>

–-where <condition>

–-query <SQL query>

Example:

sqoop import –connect jdbc:mysql://db.one.com/corp   --table INTELLIPAAT_EMP  --where “start_date> ’2016-07-20’ ”

sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp LIMIT 20”

sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”

Sqoop supports data imported into the following services:

HDFS
Hive
HBase
Hcatalog
Accumulo

Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.

No, Sqoop needs JDBC and a connector to connect a database.

Sqoop command to control the number of mappers

We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.

Syntax:   -m, --num-mappers <n>

Sqoop command to show all the databases in MySQL server

$ sqoop list –databases –connect jdbc:mysql://database.test.com/

Sqoopmetastore

It is a tool for using hosts in a shared metadata repository. Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the

–meta-connect argument.

The purpose of sqoop-merge is:

This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.

Apache Hive

The Apache Hive is a data warehouse software that lets you read, write and manage huge volumes of datasets that are stored in a distributed environment using SQL. It is possible to project structure onto data that is in storage. Users can connect to Hive using a JDBC driver and a command-line tool.

Hive is an open system. We can use Hive for analyzing and querying in large datasets of Hadoop files. It’s similar to SQL. The present version of Hive is 0.13.1.

Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transactions.

Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.

Hive is most suitable for following data warehouse applications

Analyzing the relatively static data
Less Responsive time
No rapid changes in data.

Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing. Hive is suitable for data warehouse applications in large data sets.