• Articles
  • Tutorials
  • Interview Questions

Difference Between Pig, Hive, and Sqoop

Apache Pig

The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data. Pig also consists of the infrastructure to evaluate the programs. The advantage of Pig programming is that it can easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically done using the textual language Pig Latin.

Pig Latin features

  • Simple programming: it is easy to code, execute and manage
  • Better optimization: The system can automatically optimize the execution
  • Extensive nature: it can be used to achieve highly specific processing tasks

Watch this video on HIVE by Intellipaat:

Video Thumbnail

Pig can be used for the following purposes:

  • ETL data pipeline
  • Research on raw data
  • Iterative processing.

The scalar data types in pig are int, float, double, long, chararray, and bytearray. The complex data types in Pig are map, tuple, and bag.

Map: The data element with the data type chararray where an element has pig data type include complex data type

Example- [city’#’bang’,’pin’#560001]

In this city and pin is data element mapping to values.

Tuple: It is a collection of data types and has a fixed length. A tuple is having multiple fields and these are ordered.

Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by a comma

Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

LOAD function:

Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language, we need to mention the input, which is completed by using the ‘load’ keyword.

The LOAD syntax is

LOAD ‘mydata’ [USING function] [AS schema];

Example- A = LOAD ‘intellipaat.txt’;

A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘t’);

The relational operations in Pig:

foreach, order by, filters, group, distinct, join, limit.

foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.

A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );

B = foreach A generate emp_name, emp_id;

Filters: It contains a predicate and allows us to select which records will be retained in our data pipeline.

Syntax: alias = FILTER alias BY expression;

Alias indicates the name of the relation, By indicating the required keyword, and the expression has Boolean.

Example: M = FILTER N BY F5 == 4;

Certification in Bigdata Analytics

Apache Sqoop

Apache Sqoop is a tool that is extensively used to transfer large amounts of data from Hadoop to the relational database servers and vice-versa. Sqoop can be used to import the various types of data from Oracle, MySQL, and other databases.

Important Sqoop control commands to import RDBMS data

  • Append: Append data to an existing dataset in HDFS. –append
  • Columns: columns to import from the table. –columns <col,col……>
  • Where: where clause to use during import. –where <where clause>

The common large objects in Sqoop are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set the lob limit as ZERO (0) then it is stored in external memory.

Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is

--columns <col1,col2……>

-where <condition>

-query <SQL query>

Example:

sqoop import –connect jdbc:mysql://db.one.com/corp   --table INTELLIPAAT_EMP  --where “start_date> ’2016-07-20’ ”

sqoopeval  –connect jdbc:mysql://db.test.com/corp   –query  “SELECT * FROM intellipaat_emp LIMIT 20”

sqoop import –connect jdbc:mysql://localhost/database  –username root  –password aaaaa –columns “name,emp_id,jobtitle”

Sqoop supports data imported into the following services:

Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.

No, Sqoop needs JDBC and a connector to connect a database.

Sqoop command to control the number of mappers

We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.

Syntax:   -m, --num-mappers <n>

Sqoop command to show all the databases in MySQL server

$ sqoop list –databases –connect jdbc:mysql://database.test.com/

Sqoopmetastore

It is a tool for using hosts in a shared metadata repository.  Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the

–meta-connect argument.

The purpose of sqoop-merge is:

This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.

Make yourself job-ready with these Hadoop Interview Questions and Answers today!

Apache Hive

The Apache Hive is a data warehouse software that lets you read, write and manage huge volumes of datasets that are stored in a distributed environment using SQL. It is possible to project structure onto data that is in storage. Users can connect to Hive using a JDBC driver and a command-line tool.

Hive is an open system.  We can use Hive for analyzing and querying in large datasets of Hadoop files. It’s similar to SQL. The present version of Hive is 0.13.1.

Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transactions.

Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.

Hive is most suitable for following data warehouse applications

  • Analyzing the relatively static data
  • Less Responsive time
  • No rapid changes in data.

Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing. Hive is suitable for data warehouse applications in large data sets.

Become a Big Data Architect

The two types of tables in Hive

  • Managed table
  • External table

We can change the settings within the Hive session, using the SET command. It helps to change Hive job settings for an exact query.

Example: The following commands show buckets are occupied according to the table definition.

hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set byHive.

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

And this list will not include defaults of Hadoop. So we should use the below like

SET -v

It will list all the properties including the Hadoop defaults in the system.

Add a new node with the following steps

1)Take a new system – create a new username and password

2) Install  the SSH and with master node setup ssh connections

3) Add sshpublic_rsa id key to the authorized keys file

4) Add the new data node hostname, IP address and other details in /etc/hosts slaves file

192.168.1.102 slave3.in slave3

5) Start the DataNode on New Node

6) log in to the new node like suhadoop or ssh -X [email protected]

7) Start HDFS of a newly added slave node by using the following command

./bin/hadoop-daemon.sh start data node

8) Check the output of jps command on a new node.

To become proficient in Apache spark, register for our Apache Spark and Scala Training online now!

Course Schedule

Name Date Details
Big Data Course 23 Nov 2024(Sat-Sun) Weekend Batch View Details
30 Nov 2024(Sat-Sun) Weekend Batch
07 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.