Pig Basics User Handbook

Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.

Apache Pig:

It is a high-level platform for creating programs that runs on Hadoop, the language is known as Pig Latin. Pig can execute its Hadoop jobs in MapReduce

Data types:

A particular kind of data defined by the values it can take

  • Simple data types:

    • Int – It is a signed 32 bit integer
    • Long- It is a signed 64 bit integer
    • Float- 32 bit floating point
    • Double- 64 bit floating point
    • Chararray- Character array in UTF 8 format
    • Bytearray- byte array (blob)
    • Boolean: True or False
  • Complex data types:

    • Tuple: It is an ordered set of fields
    • Bag: It is a collection of tuples
    • Map: A set of key value pairs

Apache Pig Components:

  • Parser: Parser is used to check the syntax of the scripts.
  • Optimizer: It is used for the logical optimizations such as projection and push down
  • Compiler: Compiler is used to compile the optimized logical plan into a series of MapReduce jobs
  • Execution engine: The MapReduced jobs are executed on Hadoop, and the desired results are obtained

Certification in Bigdata Analytics

Pig execution modes:

  • Grunt mode: This is a very interactive and useful mode in testing syntax checking and ad hoc data exploration
  • Script mode: It is used to run set of instructions from a file
  • Embedded mode: It is useful to execute pig programs from a java program
  • Local mode: In this mode the entire pig job runs as a single JVM process
  • MapReduce Mode: In this mode, pig runs the jobs as a series of map reduce jobs
  • Tez: In this mode, pig jobs run as a series of tez jobs

Pig execution modes

Apache Pig Architecture

Watch this video on PIG by Intellipaat:

PIG Basics Cheat Sheet

Pig commands equivalent to the SQL functions:

FunctionsPig commands
SELECTFOREACH alias GENERATE column_name,column_name;
DISTINCTDISTINCT(FOREACH aliasgenerate column_name, column_name);
WHEREFOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name;
AND/ORFILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3;
ORDER BYORDER alias BY column_name ASC|DESC,column_name ASC|DESC;
TOP/LIMITFOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias);
GROUP BYFOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name);
LIKEFILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL;
INFILTER alias BY column_name IN(value1, value2,…);
JOINFOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s);
LEFT/RIGHT/FULL OUTERJOINFOREACH(JOINalias1 BY  column_name LEFT|RIGHT|FULL,alias2 BY  column_name) GENERATE column_name(s);
UNION ALLUNION  alias1, alias2;
AVGFOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
COUNT DISTINCTFOREACH alias{Unique _column=DISTINT Column_name);};
MAXFOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
MINFOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
SUMFOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
HAVINGFILTER alias BYAggregate_function(column_name)operatorValue;
SUBSTRINGFOREACH aliasGENERATESUBSTRING(column_name,start,Star+length) as Some_name;

Become a Big Data Architect

Pig Operators:

Loading and storingLOAD
It is used to load data into a relation
Dumps the data into the console
Stores data in a given location
Grouping data and joiningGROUP
Groups based on the key will group the data from multiple relations
Cross join is used to join two or more relations
It is used for limiting the results
It is used for sorting by categories or fields
Data setsUNION
It is used for combining multiple relations
It is used for splitting the relations

Basic Operators:

Arithmetic operators+, -, *, /, %, ?, :
Boolean operatorsAnd, or, not
Casting operatorsCasting from one datatype to another
Comparison Operators==, !=, >, <, >=, <=, matches
Construction operatorsUsed to construct tuple(), bag{}, map[]
Dereference operatorsUsed to dereferencing as tuples(tuple.id or tuple.(id,…)),
bags(bag.id or bag.(id,…))and
maps(map# ‘key’)
Disambiguate operators(::)
It  used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators
Flatten operatorIt is used to flatten un-nests tuples as well as bags
Null operatorIs null, is not null
Sign operators+-> has no effect,
–>It changes the sign of a positive/negative number

Relational Operators:

COGROUP/ GROUPIt is used to group the data in one or more relations
COGROUP operator groups together the tuples that has the same group key
CROSSThis operator is used to compute the cross product of two or more relations
DEFINEThis operator assigns an alias to an UDF or a streaming command
DISTINCTThis operator will remove the duplicate tuples from a relation
FILTERIt is used to generate the transformation for each statement as specified
FOREACHIt selects the tuples for a relation based on a the specified condition
IMPORTThis operator imports macros defined in a separate file
JOINThis operator performs inner join of two or more relations based on common field values
LOADThis operator loads the data from a file system
MAPREDUCEThis operator executes the native MapReduce jobs in a Pig script
ORDER BYThis will sort the relation based on two or more fields
SAMPLEDivides the relation into two or more relations, and selects a random data sample based on a specified size
SPLITThis will partition the relation based on some conditions or expressions as specified
STOREThis will store or save the result in a file system
STREAMThis operator sends the data to an external script or program
UNIONThis operator is used to compute the unions of two or more relations

Certification in Bigdata Analytics

Diagnostic Operators:

DescribeReturns the schema of the relation
DumpIt will dump or display the result on screen
ExplainDisplays execution plans
IllustrateIt displays the step by step execution for the sequence of statements

Download a Printable PDF of this Cheat Sheet

We have covered all the basics of Pig Basics in this cheat sheet. If you want to start learning Pig Basics in depth then check out the Hadoop Administrator Online Training and Certification by Intellipaat.
Not only will you get to learn and implement Pig Basics with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!

Recommended Videos

Leave a Reply

Your email address will not be published. Required fields are marked *