• Articles
  • Tutorials
  • Interview Questions

What is Apache Pig?

What is Apache Pig?

In this age, everything is digital. Since everything is digital in companies and industries, all their data is digital too. Any large source of data can face problems when processing data. Apache Pig is a platform for observing or inspecting large sets of data. It allows developers to create query execution routines to analyze large, distributed datasets. This saves them from doing low-level work in MapReduce. The language upon which this platform operates is Pig Latin.

Apache Pig: Definition

Apache Pig is a platform utilized to analyze large datasets consisting of high level language for expressing data analysis programs along with the infrastructure for assessing these programs. Pig programs can be highly parallelized due to the virtue of which they can handle large data sets.

Pig was initially developed by Yahoo! for its data scientists who were using Hadoop. It was incepted to focus mainly on analysis of large datasets rather than on writing mapper and reduce functions. This allowed users to focus on what they want to do rather than bothering with how its done. On top of this with Pig language you have the facility to write commands in other languages like Java, Python etc. Big applications that can be built on Pig Latin can be custom built for different companies to serve different tasks related to data management. Pig systemizes all the branches of data and relates it in a manner that when the time comes, filtering and searching data is checked efficiently and quickly.

Get a good grounding in Apache Pig through this video:

Video Thumbnail
 

Codes or scripts written on Pig Latin can be transformed into MapReduce applications which can deal with parallel data and provide multithreading programs that increase the amount of data processed at a time and with that it also provides redundancy, meaning that the vital parts of the function are duplicated so the system doesn’t depend on just one path to reach its goal. Using the User Defined Function (USD), Pig can intake data from all sorts of sources including data files, streams or any other source and add it to its system for future use.

Differences Between MapReduce and Pig

As said earlier Pig was built on top of MapReduce. But Pig is way better than MapReduce. Here are some interesting facts regarding Pig tool.

  1. 20 lines of Pig code = 400 lines of MapReduce code
  2. Pig needs only 1/16th of the development time as compared to MapReduce
Pig MapReduce
Type of Programming language Procedural dataflow language Data processing paradigm
It is A high-level language Low level and rigid
Join operation Performing a job operation is fairly simple It is difficult to perform join operations between datasets
Skills needed A good knowledge of SQL is needed The developer needs to have a robust knowledge of Java
Code length Due to the multi-query approach, the code length is greatly reduced MapReduce requires 20 times more code length to accomplish the same task
Compilation No need for compilation as every Pig operator is converted internally into MapReduce jobs MapReduce jobs have a prolonged compilation process
Nested data types Present in Pig Not present in MapReduce
 
Certification in Bigdata Analytics

Pig architecture

Pig Latin is the language used to analyze data in Hadoop using Pig. Pig comes with a rich set of data types and operators to do different data operations. If the programmers hope to run a task in Pig then they need to use this language to write a Pig script. They then can use it to execute through various execution mechanisms like UDFs, Embedded, Grunt Shell. To produce the intended output these scripts will go through a series of transformations applied by Pig to give the desired output.

pig archite

Pig internally converts these scripts into a series of MapReduce jobs and therefore the programmers work is made much easy. This is the architecture of Pig as shown below –

Parser – It handles the Pig scripts. It also performs the checking of the script syntax, type checking and other checks. The parser output will be a DAG (directed acyclic graph) and this represents the Pig Latin statements and logical operators. Script logical operators are taken as nodes and the dataflows that can be seen are just edges.

Optimizer – The parser output in DAG form is passed through the logical optimizer and it carries out the logical optimizations like pushdown and projection.

Compiler – It compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine – The sorted MapReduce jobs are submitted to Hadoop. These jobs then run on Hadoop to produce the intended results.

Pig execution modes

There are two execution modes of Pig. These modes depend on where the Pig script is to run and also on the location of where the data is residing. Data can be stored on a single machine or in various clusters.

Pig local mode – The Pig implements on a single JVM and subsequently accesses the file system. This mode is ideal for dealing with smaller datasets. The user provides –x local to get into the Pig local mode of execution. To load data Pig always looks for the local file system path.

Pig MapReduce mode – The user gets proper Hadoop cluster setup and installations on this mode. Here the Pig uses the –x MapReduce option to get into the mode. This is the default mode where the Pig installs and just entering ‘Pig’ into the shell also works. The queries which the Pig translates into MapReduce jobs are run on top of Hadoop cluster. LOAD, STORE statements actually read from HDFS filesystem, show output and also to process data.

Video Thumbnail
 

Pig schema

In this section of Pig tutorial we will discuss about the nuances in the Pig language’s schema. Pig assigns name to the field and declares field data type. To define schema in Pig Latin is optional but it is a good practice to use them rather than not. If this practice is followed then error checking becomes efficient resulting in smooth execution of program. Schema can be declared as both simple and complex data types. We expound more aspects of Pig schema through the following points on the Pig tutorial:

  • Bytearray is the default data type that gets assigned when the schema only specifies the field name.
  • Assume you have assigned a name to the field. This field could be accessed by both the positional notation as well as the name. If the field name is not provided then we can access it only through the positional notation Ex: $x where x is the index number where the data type resides.
  • Anything multiplied by zero will be zero. Likewise if you perform an operation which is a combination of relations and if the schema is missing on any relation then the resulting relation will also have a null schema.
  • Suppose the schema is null then pig will consider it as byte array. The real data type of the field will therefore be dynamically determined.

Features of Pig

Let’s examine the Pig tool to see what features it really constitutes.

  • Operator set – Many operations like join, filter and sort can be performed through these operators.
  • Programming ease – Pig Latin closely resembles to SQL. It is also easy to write a Pig script if you’re good at SQL.
  • User defined functions – Through Pig developers can create UDFs in other programming languages like Java and invoke them in Pig scripts.
Features of Pig
  • Extensibility – Developers can develop their own functions to read, process and write data.
  • Optimization opportunities – Pig tasks optimize their execution automatically. The programmers only need to focus on semantics of the language.
  • The animal Pig eats anything it gets its mouth on. Apache Pig is named as such as it similarly processes all kinds of data like structured, semi-structured and unstructured data and stores the result in HDFS.

Differences between Pig and Hive

These two tools are not similar even though from a non technical perspective they might seem as so. Let’s see how they vary in 7 different ways.

PigHive
Type of Programming languageProcedural dataflow languageDeclarative SQLlike language
Used forProgrammingCreating reports
Used byResearchers and programmersData analysts
Operating domainClient side of the clusterServer side of the cluster
DatabaseIt is like SQL but varies to a great extentIt utilizes the exact variation of dedicated SQL DDL language by defining tables in advance.
Relationship with SQLIt is like SQL but varies to a great extentUses SQL and hence is easy to learn for database experts
Avro file formatPig supports itHive doesn’t support it

Get 100% Hike!

Master Most in Demand Skills Now!

 

Pig Latin data model

This model is fully nested and map and tuple non-complex data types are allowed in this language. This kind of Pig programming is used to handle very large datasets.

Atom

Atom is any single value in this language regardless of the data and type. It is stored as string and used as number as well as string. Pig atomic values are long, int, float, double, bytearray, chararray. A field is a piece of data or a simple atomic value. Ex : ‘26’ or ‘avi’

Tuple

Tuple is formed by the ordered set of fields where the fields can be of any type. It is similar to a row in RDBMS table.

Ex: (avi, 26)

Pig Latin data model

Bag

An unordered set of tuples is known as bag. Said in another way a bag is a collection of non-unique tuples. There is flexible schema as every tuple can have any number of fields. A bag is represented by ‘{}’. It mirrors with the RDBMS in its features except that is not necessary that every tuple contain the same number of fields. It is also not required that those fields in the same position have the same type.

Ex : {(avi,26), (Nithin, 30)}

A bag can be in a field in a relation in which case it is known as an inner bag.

Ex : { avi,26,{9739724366, [email protected],}}

Map

A set of key value pairs is known as a map. The key should be of type chararray and unique. The value can be any type and map is represented by ‘[]’.

Ex : [name#avi, age#26]

Relation A bag of tuples is known as a relation and these are unordered.

Become a Big Data Architect

Pig commands – the developer’s ammunition!

There are two types of commands which Pig tool gives out to developers namely Shell and Utility commands.

Pig Shell commands

Fs – This command is used to invoke Fs shell command in a Pig script or grunt shell.

Sh – This command is used to invoke any sh shell command in a Pig script or Grunt shell.

Pig Utility commands      

Clear – The cursor is positioned at the top of the screen and the Pig grunt shell screen is cleared. Similar to clrscr() in C language.

Exec – It simply runs the Pig scripts.

Help – Gives out a list of Pig commands similar to UNIX’s help command.

History – Gives out the list of statements so far.

Kill – This command kills a job.

Quit – The control comes out of Pig grunt shell.

Run – This command simple runs a Pig script.

Set – This command assigns keys the values.

Pig script example

We have shown how the fs command works here in the Pig script:

Fs – mkdir /temp (directory is created)

Fs – copyFromLocal file-a file-b

Fs –ls file-b (file is listed)

Data movement through Pig operators

We know Pig processes data very efficiently. But let’s see through which mechanisms data traverses between Pig and HDFS/local filesystem.

Pig Load operator

Through LOAD operator data from the HDFS/local file system can be loaded into Pig.

The_Relation = LOAD ‘INPUT file path’ using function as schema;

The operator ‘=’ is crucial here where left hand side of the operator is where we want to store the data. The right hand side of the operator is where we want to store the data.

Pig Store operator

STORE The_relation into ‘required directory path’ [USING function];

The The_relation is stored into the required directory path through the process defined by the function.

How to download Pig?

1. Go to Apache Pig website. Under News section,click on the link release page as shown below.

How to download Pig

2. On doing this you will be redirected to the Apache Pig Releases On this page, under the Downloadsection, you will see download a release now link. Click it.

Apache Pig Releases

3. Go to the link shown below.

Backup Sites

4. The link will take you to the Pig Releases This page contains various versions of Apache Pig where you have toopt the latest version among them which currently is pig-.0.17.0.

Pig Release

5.  You will have the source and binary files of Apache Pig in various distributions within these folders. Download the tar files of the source and binary files of Apache Pig 0.15, 15.0-src.tar.gzand pig-0.15.0.tar.gz.

Index Of

Applications of Pig

  • Web logs processing (i.e error logs)
  • Data processing for search platforms – If you want to do a search across multiple sets of data then Pig can be used for the purpose.
  • Across large datasets Pig can be used if you need support for ad hoc queries
  • If in processing large datasets then quick prototyping can be done using Pig tool. Pig will normally be used by data scientists.

Use cases of Pig

There are many organizations and even various applications today that use Pig technology. These include big companies like LinkedIn, Mendeley etc who use Pig for finding or matching people from different places on their website. Other than that people can use this technology to match places and find jobs. Search engines or question answer engines also use Apache Pig technology and offer Hadoop tutorial to their employees so that in the future they can utilize this gained knowledge in the maintenance of the company servers.

Other big names in the social media world such as Yahoo and Twitter also use Pig technology. Twitter uses their applications for log in and mine tweeting, while Yahoo uses it for matching and finding relevant data. Stanford University has dedicated a good portion of its research department on Pig technology and Hadoop training as they understand the power, need and use of this technology as a tool in the business world and the social media circle.

It is also used in the de-identification of personal health information. Often those who volunteer for medical tests want privacy and therefore those data has to be de indentified meaning the data has to be deleted. The challenges here are that huge amount of data flows into the system regularly. Also there are multiple data sources where we need to aggregate the data from. Processing this data and de identifying it had its own problems.

Learn new Technologies

Conclusion

Pig is the technology that builds a bridge between Hadoop, Hive and other data management technologies which can be further used to eradicate problems related to management of big or any size of data. MapReduce has gone out of style because it is way easier to write code in Pig and also Apache Spark is partly responsible. On top of it, it is more effective also as we have seen that 10 lines of MapReduce code is equivalent to a single line of Pig code.

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.