bing
Flat 10% & upto 50% off + Free additional Courses. Hurry up!

Introduction to Pig, Sqoop, and Hive

Apache Pig

The Apache Pig is a platform for managing large sets of data which consists of high-level programming to analyze the data. Pig also consists of the infrastructure to evaluate the programs. The advantages of Pig programming is that it can easily handle parallel processes for managing very large amounts of data. The programming on this platform is basically done using the textual language Pig Latin.

Pig Latin comes with the following features:

  • Simple programming: it is easy to code, execute and manage
  • Better optimization: system can automatically optimize the execution
  • Extensive nature: it can be used to achieve highly specific processing tasks

Pig can be used for following purposes:

  • ETL data pipeline
  • Research on raw data
  • Iterative processing.

The scalar data types in pig are int, float, double, long, chararray, and bytearray. The complex data types in Pig are map, tuple, and bag.

Map: The data element with the data type chararray where element has pig data type include complex data type

Example- [city’#’bang’,’pin’#560001]

In this city and pin are data element mapping to values.

Tuple: It is a collection of data types and it has fixed length. Tuple is having multiple fields and these are ordered.

Bag: It is a collection of tuples, but it is unordered, tuples in the bag are separated by comma

Example: {(‘Bangalore’, 560001),(‘Mysore’,570001),(‘Mumbai’,400001)

LOAD function:

Load function helps to load data from the file system. It is a relational operator. In the first step in data-flow language we need to mention the input, which is completed by using ‘load’ keyword.

The LOAD syntax is

LOAD ‘mydata’ [USING function] [AS schema];

Example- A = LOAD ‘intellipaat.txt’;

A = LOAD ‘intellipaat.txt’ USINGPigStorage(‘\t’);

The relational operations in Pig:

foreach, order by, filters, group, distinct, join, limit.

foreach: It takes a set of expressions and applies them to all records in the data pipeline to the next operator.

A =LOAD ‘input’ as (emp_name :charrarray, emp_id : long, emp_add : chararray, phone : chararray, preferences : map [] );

B = foreach A generate emp_name, emp_id;

Filters: It contains a predicate and it allows us to select which records will be retained in our data pipeline.

Syntax: alias = FILTER alias BY expression;

Alias indicates the name of the relation, By indicates required keyword and the expression has Boolean.

Example: M = FILTER N BY F5 == 4;

Apache Sqoop

Apache Sqoop is a tool that is extensively used to transfer large amounts of data from Hadoop to the relational database servers and vice-versa. Sqoop can be used to import the various types of data from Oracle, MySQL and such other databases.

Important Sqoop control commands to import RDBMS data

  • Append: Append data to an existing dataset in HDFS. –append
  • Columns: columns to import from the table. –columns <col,col……>
  • Where: where clause to use during import. –where <where clause>

The common large objects in Sqoop are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0) then it is stored in external memory.

Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is

-columns <col1,col2……>

-where <condition>

-query <SQL query>

Example:

sqoop import –connect jdbc:mysql://db.one.com/corp   –table INTELLIPAAT_EMP  –where “start_date> ’2016-07-20’ ”

sqoopeval  –connect jdbc:mysql://db.test.com/corp   –query  “SELECT * FROM intellipaat_emp LIMIT 20”

sqoop import –connect jdbc:mysql://localhost/database  –username root  –password aaaaa –columns “name,emp_id,jobtitle”

Sqoop supports data imported into following services:

 

Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.

No, Sqoop needs JDBC and a connector to connect a database.

Sqoop command to control the number of mappers

We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.

Syntax:   -m, –num-mappers <n>

Sqoop command to show all the databases in MySQL server

$ sqoop list –databases –connect jdbc:mysql://database.test.com/

Sqoopmetastore

It is a tool for using hosts in a shared metadata repository.  Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the

–meta-connect argument.

The purpose of sqoop-merge is:

This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.

Apache Hive

The Apache Hive is a data warehouse software that lets you read, write and manage huge volumes of datasets that is stored in a distributed environment using SQL. It is possible to project structure onto data that is in storage. Users can connect to Hive using a JDBC driver and a command line tool.

Hive is an open system.  We can use Hive for analyzing and querying in large datasets of Hadoop files. It’s similar to SQL. The present version of Hive is 0.13.1.

Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction.

Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.

Hive is most suitable for following data warehouse applications

  • Analyzing the relatively static data
  • Less Responsive time
  • No rapid changes in data.

Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing. Hive is suitable for data warehouse applications in large data sets.

The two types of tables in Hive

  1. Managed table
  2. External table

We can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact query.

Example: The following commands shows buckets are occupied according to the table definition.

hive> SET hive.enforce.bucketing=true;

We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set byHive.

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

And this list will not include defaults of Hadoop. So we should use the below like

SET -v

It will list all the properties including the Hadoop defaults in the system.

Add a new node with the following steps

1)Take a new system – create a new username and password

2) Install  the SSH and with master node setup ssh connections

3) Add sshpublic_rsa id key to the authorized keys file

4) Add the new data node hostname, IP address and other details in /etc/hosts slaves file

192.168.1.102 slave3.in slave3

5) Start the DataNode on New Node

6) Login to the new node like suhadoop or ssh -X hadoop@192.168.1.103

7) Start HDFS of a newly added slave node by using the following command

./bin/hadoop-daemon.sh start data node

8) Check the output of jps command on a new node.

"0 Responses on Introduction to Pig, Sqoop, and Hive"

Leave a Message

Your email address will not be published.

Training in Cities

Bangalore, Hyderabad, Chennai, Delhi, Kolkata, UK, London, Chicago, San Francisco, Dallas, Washington, New York, Orlando, Boston

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.

top

Sales Offer

  • To avail this offer, enroll before 09th December 2016.
  • This offer cannot be combined with any other offer.
  • This offer is valid on selected courses only.
  • Please use coupon codes mentioned below to avail the offer
offer-june

Sign Up or Login to view the Free Introduction to Pig, Sqoop, and Hive.