PIG Basics Cheat Sheet

Pig Basics User Handbook

Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be a handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.

Apache Pig:

It is a high-level platform for creating programs that run on Hadoop, the language is known as Pig Latin. Apache Pig can execute its Hadoop jobs in MapReduce

Data types:

A particular kind of data defined by the values it can take

Simple data types:
- Int – It is a signed 32-bit integer
- Long- It is a signed 64-bit integer
- Float- 32-bit floating-point
- Double- 64-bit floating-point
- Chararray- Character array in UTF 8 format
- Bytearray- byte array (blob)
- Boolean: True or False
Complex data types:
- Tuple: It is an ordered set of fields
- Bag: It is a collection of tuples
- Map: A set of key-value pairs

Apache Pig Components:

Parser: Parser is used to check the syntax of the scripts.
Optimizer: It is used for logical optimizations such as projection and pushes down
Compiler: Compiler is used to compile the optimized logical plan into a series of MapReduce jobs
Execution engine: The MapReduce jobs are executed on Hadoop, and the desired results are obtained

Pig execution modes:

Grunt mode: This is a very interactive and useful mode in testing syntax checking and ad hoc data exploration
Script mode: It is used to run a set of instructions from a file
Embedded mode: It is useful to execute pig programs from a java program
Local mode: In this mode, the entire pig job runs as a single JVM process
MapReduce Mode: In this mode, the pig runs the jobs as a series of map-reduce jobs
Tez: In this mode, pig jobs run as a series of tez jobs

Apache Pig Architecture

Watch this video on PIG by Intellipaat:

Pig commands equivalent to the SQL functions:

Functions	Pig commands
SELECT	FOREACH alias GENERATE column_name,column_name;
SELECT*	FOREACH alias GENERATE *;
DISTINCT	DISTINCT(FOREACH aliasgenerate column_name, column_name);
WHERE	FOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name;
AND/OR	FILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3;
ORDER BY	ORDER alias BY column_name ASC\|DESC,column_name ASC\|DESC;
TOP/LIMIT	FOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias);
GROUP BY	FOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name);
LIKE	FILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL;
IN	FILTER alias BY column_name IN(value1, value2,…);
JOIN	FOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s);
LEFT/RIGHT/FULL OUTERJOIN	FOREACH(JOINalias1 BY column_name LEFT\|RIGHT\|FULL,alias2 BY column_name) GENERATE column_name(s);
UNION ALL	UNION alias1, alias2;
AVG	FOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
COUNT	FOREACH (GROUP alias ALL) GENERATE COUNT(alias);
COUNT DISTINCT	FOREACH alias{Unique _column=DISTINT Column_name);};
MAX	FOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
MIN	FOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
SUM	FOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
HAVING	FILTER alias BYAggregate_function(column_name)operatorValue;
UCASE/UPPER	FOREACH aliasGENERATEUPPER(column_name);
LCASE/LOWER	FOREACH aliasGENERATELOWER(column_name);
SUBSTRING	FOREACH aliasGENERATESUBSTRING(column_name,start,Star+length) as Some_name;
LEN	FOREACH aliasGENERATE SIZE(column_name)
ROUND	FOREACH aliasGENEARATE ROUND(column_name);

Operators:

Pig Operators:

Type	Command	Description
Loading and storing	LOAD DUMP STORE	It is used to load data into a relation Dumps the data into the console Stores data in a given location
Grouping data and joining	GROUP COGROUP CROSS JOIN	Groups based on the key will group the data from multiple relations Cross join is used to join two or more relations
Storing	LIMIT ORDER	It is used for limiting the results It is used for sorting by categories or fields
Data sets	UNION SPLIT	It is used for combining multiple relations It is used for splitting the relations

Basic Operators:

Operators	Description
Arithmetic operators	+, -, *, /, %, ?, :
Boolean operators	And, or, not
Casting operators	Casting from one datatype to another
Comparison Operators	==, !=, >, <, >=, <=, matches
Construction operators	Used to construct tuple(), bag{}, map[]
Dereference operators	Used to dereferencing as tuples(tuple.id or tuple.(id,…)), bags(bag.id or bag.(id,…))and maps(map# ‘key’)
Disambiguate operators	(::) It used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators
Flatten operator	It is used to flatten un-nests tuples as well as bags
Null operator	Is null, is not null
Sign operators	+-> has no effect, –>It changes the sign of a positive/negative number

Relational Operators:

Operators	Description
COGROUP/ GROUP	It is used to group the data in one or more relations COGROUP operator groups together the tuples that have the same group key
CROSS	This operator is used to compute the cross product of two or more relations
DEFINE	This operator assigns an alias to a UDF or a streaming command
DISTINCT	This operator will remove the duplicate tuples from a relation
FILTER	It is used to generate the transformation for each statement as specified
FOREACH	It selects the tuples for a relationship based on a specified condition
IMPORT	This operator imports macros defined in a separate file
JOIN	This operator performs the inner join of two or more relations based on common field values
LOAD	This operator loads the data from a file system
MAPREDUCE	This operator executes the native MapReduce jobs in a Pig script
ORDER BY	This will sort the relation based on two or more fields
SAMPLE	Divides the relation into two or more relations, and selects a random data sample based on a specified size
SPLIT	This will partition the relation based on some conditions or expressions as specified
STORE	This will store or save the result in a file system
STREAM	This operator sends the data to an external script or program
UNION	This operator is used to compute the unions of two or more relations

Diagnostic Operators:

Operator	Description
Describe	Returns the schema of the relation
Dump	It will dump or display the result on the screen
Explain	Displays execution plans
Illustrate	It displays the step by step execution for the sequence of statements

Download a Printable PDF of this Cheat Sheet

We have covered all the basics of Pig Basics in this cheat sheet. If you want to start learning Pig Basics in-depth then check out the Hadoop Certification by Intellipaat.
Not only will you get to learn and implement Pig Basics with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!