What is Pig in Hadoop?
Pig Hadoop is basically a high-level programming language that is helpful for the analysis of huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to perform a lot of data administration operations.
For writing data analysis programs, Pig renders a high-level programming language called Pig Latin. Several operators are provided by Pig Latin using which personalized functions for writing, reading, and processing of data can be developed by programmers.
For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these scripts need to be transformed into MapReduce tasks. This is achieved with the help of Pig Engine.
Watch this video on ‘Apache Pig Tutorial’:
Why Apache Pig?
By now, we know that Apache Pig is used with Hadoop, and Hadoop is based on the Java programming language. Now, the question that arises in our minds is ‘Why Pig?’ The need for Apache Pig came up when many programmers weren’t comfortable with Java and were facing a lot of struggle working with Hadoop, especially, when MapReduce tasks had to be performed. Apache Pig came into the Hadoop world as a boon for all such programmers.
- After the introduction of Pig Latin, now, programmers are able to work on MapReduce tasks without the use of complicated codes as in Java.
- To reduce the length of codes, the multi-query approach is used by Apache Pig, which results in reduced development time by 16 folds.
- Since Pig Latin is very similar to SQL, it is comparatively easy to learn Apache Pig if we have little knowledge of SQL.
For supporting data operations such as filters, joins, ordering, etc., Apache Pig provides several in-built operations.
Features of Pig Hadoop
There are several features of Apache Pig:
- In-built operators: Apache Pig provides a very good set of operators for performing several data operations like sort, join, filter, etc.
- Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to write a Pig script.
- Automatic optimization: The tasks in Apache Pig are automatically optimized. This makes the programmers concentrate only on the semantics of the language.
- Handles all kinds of data: Apache Pig can analyze both structured and unstructured data and store the results in HDFS.
Get 100% Hike!
Master Most in Demand Skills Now!
Apache Pig Architecture
The main reason why programmers have started using Hadoop Pig is that it converts the scripts into a series of MapReduce tasks making their job easy. Below is the architecture of Pig Hadoop:
Pig Hadoop framework has four main components:
- Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser. The parser is responsible for checking the syntax of the script, along with other miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph (DAG) that contains Pig Latin statements, together with other logical operators represented as nodes.
- Optimizer: After the output from the parser is retrieved, a logical plan for DAG is passed to a logical optimizer. The optimizer is responsible for carrying out the logical optimizations.
- Compiler: The role of the compiler comes in when the output from the optimizer is received. The compiler compiles the logical plan sent by the optimizing The logical plan is then converted into a series of MapReduce tasks or jobs.
- Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop for yielding the desired result.
Downloading and Installing Pig Hadoop
Follow the below steps for the Apache Pig installation. These steps are for Linux/CentOS/Windows (using VM/Ubuntu/Cloudera). In this tutorial section on ‘Pig Hadoop’, we are using CentOS.
Step 1: Download the Pig.tar file by writing the following command on your Terminal:
wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
Step 2: Extract the tar file (you downloaded in the previous step) using the following command:
tar -xzf pig-0.16.0.tar.gz
Your tar file gets extracted automatically from this command. To check whether your file is extracted, write the command ls for displaying the contents of the file. If you see the below output, the Pig file has been successfully extracted.
Step 3: Edit the .bashrc file for updating the Apache Pig environment variables. It is required to set this up in order to access Pig from a directory instead of going to the Pig directory for executing Pig commands. Other applications can also access Pig using this path of Apache Pig from this file
A new window will open up wherein you need to add a few commands.
When the above window pops up, write down the following commands at the end of the file:
# Set PIG_HOME
export PIG_HOME=/home/training/pig-0.16.0
export PATH=$PATH:/home/training/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
You then need to save this file in:
File > Save
You have to close this window and then, on your terminal, enter the following command for getting the changes updated:
source .bashrc
Step 4: Check the Pig Version. To check whether Pig is successfully installed, you can run the following command:
pig -version
Step 5: Start the Grunt Shell (used to run Pig Latin scripts) by running the command: Pig
By default, Pig Hadoop chooses to run MapReduce jobs in which access is required to the Hadoop cluster and the HDFS installation. But there is another mode, i.e., a local mode in which all the files are installed and run using a local host, along with the file system. You can run the localhost mode using the command: pig -x local
I hope you were able to successfully install Apache Pig. In this section on Apache Pig, we learned ‘What is Apache Pig?’, why we need Pig, its features, architecture, and finally the installation of Apache Pig.
In the next section of this tutorial, we will learn about Apache Hive.