Pig Basics User Handbook
Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be a handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.
Apache Pig:
It is a high-level platform for creating programs that run on Hadoop, the language is known as Pig Latin. Apache Pig can execute its Hadoop jobs in MapReduce
Data types:
A particular kind of data defined by the values it can take
-
Simple data types:
- Int – It is a signed 32-bit integer
- Long- It is a signed 64-bit integer
- Float- 32-bit floating-point
- Double- 64-bit floating-point
- Chararray- Character array in UTF 8 format
- Bytearray- byte array (blob)
- Boolean: True or False
-
Complex data types:
- Tuple: It is an ordered set of fields
- Bag: It is a collection of tuples
- Map: A set of key-value pairs
Apache Pig Components:
- Parser: Parser is used to check the syntax of the scripts.
- Optimizer: It is used for logical optimizations such as projection and pushes down
- Compiler: Compiler is used to compile the optimized logical plan into a series of MapReduce jobs
- Execution engine: The MapReduce jobs are executed on Hadoop, and the desired results are obtained
Pig execution modes:
- Grunt mode: This is a very interactive and useful mode in testing syntax checking and ad hoc data exploration
- Script mode: It is used to run a set of instructions from a file
- Embedded mode: It is useful to execute pig programs from a java program
- Local mode: In this mode, the entire pig job runs as a single JVM process
- MapReduce Mode: In this mode, the pig runs the jobs as a series of map-reduce jobs
- Tez: In this mode, pig jobs run as a series of tez jobs
Apache Pig Architecture
Watch this video on PIG by Intellipaat:
Pig commands equivalent to the SQL functions:
Functions | Pig commands |
SELECT | FOREACH alias GENERATE column_name,column_name; |
SELECT* | FOREACH alias GENERATE *; |
DISTINCT | DISTINCT(FOREACH aliasgenerate column_name, column_name); |
WHERE | FOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name; |
AND/OR | FILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3; |
ORDER BY | ORDER alias BY column_name ASC|DESC,column_name ASC|DESC; |
TOP/LIMIT | FOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias); |
GROUP BY | FOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name); |
LIKE | FILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL; |
IN | FILTER alias BY column_name IN(value1, value2,…); |
JOIN | FOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s); |
LEFT/RIGHT/FULL OUTERJOIN | FOREACH(JOINalias1 BY column_name LEFT|RIGHT|FULL,alias2 BY column_name) GENERATE column_name(s); |
UNION ALL | UNION alias1, alias2; |
AVG | FOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name); |
COUNT | FOREACH (GROUP alias ALL) GENERATE COUNT(alias); |
COUNT DISTINCT | FOREACH alias{Unique _column=DISTINT Column_name);}; |
MAX | FOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name); |
MIN | FOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name) |
SUM | FOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name); |
HAVING | FILTER alias BYAggregate_function(column_name)operatorValue; |
UCASE/UPPER | FOREACH aliasGENERATEUPPER(column_name); |
LCASE/LOWER | FOREACH aliasGENERATELOWER(column_name); |
SUBSTRING | FOREACH aliasGENERATESUBSTRING(column_name,start,Star+length) as Some_name; |
LEN | FOREACH aliasGENERATE SIZE(column_name) |
ROUND | FOREACH aliasGENEARATE ROUND(column_name); |
Operators:
Pig Operators:
Type | Command | Description |
Loading and storing | LOAD DUMP STORE |
It is used to load data into a relation Dumps the data into the console Stores data in a given location |
Grouping data and joining | GROUP COGROUP CROSS JOIN |
Groups based on the key will group the data from multiple relations Cross join is used to join two or more relations |
Storing | LIMIT ORDER |
It is used for limiting the results It is used for sorting by categories or fields |
Data sets | UNION SPLIT |
It is used for combining multiple relations It is used for splitting the relations |
Basic Operators:
Operators | Description |
Arithmetic operators | +, -, *, /, %, ?, : |
Boolean operators | And, or, not |
Casting operators | Casting from one datatype to another |
Comparison Operators | ==, !=, >, <, >=, <=, matches |
Construction operators | Used to construct tuple(), bag{}, map[] |
Dereference operators | Used to dereferencing as tuples(tuple.id or tuple.(id,…)), bags(bag.id or bag.(id,…))and maps(map# ‘key’) |
Disambiguate operators | (::) It used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators |
Flatten operator | It is used to flatten un-nests tuples as well as bags |
Null operator | Is null, is not null |
Sign operators | +-> has no effect, –>It changes the sign of a positive/negative number |
Relational Operators:
Operators | Description |
COGROUP/ GROUP | It is used to group the data in one or more relations COGROUP operator groups together the tuples that have the same group key |
CROSS | This operator is used to compute the cross product of two or more relations |
DEFINE | This operator assigns an alias to a UDF or a streaming command |
DISTINCT | This operator will remove the duplicate tuples from a relation |
FILTER | It is used to generate the transformation for each statement as specified |
FOREACH | It selects the tuples for a relationship based on a specified condition |
IMPORT | This operator imports macros defined in a separate file |
JOIN | This operator performs the inner join of two or more relations based on common field values |
LOAD | This operator loads the data from a file system |
MAPREDUCE | This operator executes the native MapReduce jobs in a Pig script |
ORDER BY | This will sort the relation based on two or more fields |
SAMPLE | Divides the relation into two or more relations, and selects a random data sample based on a specified size |
SPLIT | This will partition the relation based on some conditions or expressions as specified |
STORE | This will store or save the result in a file system |
STREAM | This operator sends the data to an external script or program |
UNION | This operator is used to compute the unions of two or more relations |
Diagnostic Operators:
Operator | Description |
Describe | Returns the schema of the relation |
Dump | It will dump or display the result on the screen |
Explain | Displays execution plans |
Illustrate | It displays the step by step execution for the sequence of statements |
Download a Printable PDF of this Cheat Sheet
We have covered all the basics of Pig Basics in this cheat sheet. If you want to start learning Pig Basics in-depth then check out the Hadoop Certification by Intellipaat.
Not only will you get to learn and implement Pig Basics with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!