• Articles
  • Tutorials
  • Interview Questions

PIG Basics Cheat Sheet

Pig Basics User Handbook

Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be a handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.

Apache Pig:

It is a high-level platform for creating programs that run on Hadoop, the language is known as Pig Latin. Apache Pig can execute its Hadoop jobs in MapReduce

Data types:

A particular kind of data defined by the values it can take

  • Simple data types:

    • Int – It is a signed 32-bit integer
    • Long- It is a signed 64-bit integer
    • Float- 32-bit floating-point
    • Double- 64-bit floating-point
    • Chararray- Character array in UTF 8 format
    • Bytearray- byte array (blob)
    • Boolean: True or False
  • Complex data types:

    • Tuple: It is an ordered set of fields
    • Bag: It is a collection of tuples
    • Map: A set of key-value pairs

Apache Pig Components:

  • Parser: Parser is used to check the syntax of the scripts.
  • Optimizer: It is used for logical optimizations such as projection and pushes down
  • Compiler: Compiler is used to compile the optimized logical plan into a series of MapReduce jobs
  • Execution engine: The MapReduce jobs are executed on Hadoop, and the desired results are obtained

Certification in Bigdata Analytics

Pig execution modes:

  • Grunt mode: This is a very interactive and useful mode in testing syntax checking and ad hoc data exploration
  • Script mode: It is used to run a set of instructions from a file
  • Embedded mode: It is useful to execute pig programs from a java program
  • Local mode: In this mode, the entire pig job runs as a single JVM process
  • MapReduce Mode: In this mode, the pig runs the jobs as a series of map-reduce jobs
  • Tez: In this mode, pig jobs run as a series of tez jobs

Pig execution modes

Apache Pig Architecture

Watch this video on PIG by Intellipaat:

Video Thumbnail

Pig commands equivalent to the SQL functions:

Functions Pig commands
SELECT FOREACH alias GENERATE column_name,column_name;
SELECT* FOREACH alias GENERATE *;
DISTINCT DISTINCT(FOREACH aliasgenerate column_name, column_name);
WHERE FOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name;
AND/OR FILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3;
ORDER BY ORDER alias BY column_name ASC|DESC,column_name ASC|DESC;
TOP/LIMIT FOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias);
GROUP BY FOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name);
LIKE FILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL;
IN FILTER alias BY column_name IN(value1, value2,…);
JOIN FOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s);
LEFT/RIGHT/FULL OUTERJOIN FOREACH(JOINalias1 BY  column_name LEFT|RIGHT|FULL,alias2 BY  column_name) GENERATE column_name(s);
UNION ALL UNION  alias1, alias2;
AVG FOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
COUNT FOREACH (GROUP alias ALL) GENERATE COUNT(alias);
COUNT DISTINCT FOREACH alias{Unique _column=DISTINT Column_name);};
MAX FOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
MIN FOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
SUM FOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
HAVING FILTER alias BYAggregate_function(column_name)operatorValue;
UCASE/UPPER FOREACH aliasGENERATEUPPER(column_name);
LCASE/LOWER FOREACH aliasGENERATELOWER(column_name);
SUBSTRING FOREACH aliasGENERATESUBSTRING(column_name,start,Star+length) as Some_name;
LEN FOREACH aliasGENERATE SIZE(column_name)
ROUND FOREACH aliasGENEARATE ROUND(column_name);

Become a Big Data Architect

Operators:

Pig Operators:

Type Command Description
Loading and storing LOAD
DUMP
STORE
It is used to load data into a relation
Dumps the data into the console
Stores data in a given location
Grouping data and joining GROUP
COGROUP
CROSS JOIN
Groups based on the key will group the data from multiple relations
Cross join is used to join two or more relations
Storing LIMIT
ORDER
It is used for limiting the results
It is used for sorting by categories or fields
Data sets UNION
SPLIT
It is used for combining multiple relations
It is used for splitting the relations

Make yourself job-ready with these Pig Interview Questions and Answers today!

Basic Operators:

Operators Description
Arithmetic operators +, -, *, /, %, ?, :
Boolean operators And, or, not
Casting operators Casting from one datatype to another
Comparison Operators ==, !=, >, <, >=, <=, matches
Construction operators Used to construct tuple(), bag{}, map[]
Dereference operators Used to dereferencing as tuples(tuple.id or tuple.(id,…)),
bags(bag.id or bag.(id,…))and
maps(map# ‘key’)
Disambiguate operators (::)
It  used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators
Flatten operator It is used to flatten un-nests tuples as well as bags
Null operator Is null, is not null
Sign operators +-> has no effect,
–>It changes the sign of a positive/negative number

Learn how to use Macros in C through this blog!

Relational Operators:

Operators Description
COGROUP/ GROUP It is used to group the data in one or more relations
COGROUP operator groups together the tuples that have the same group key
CROSS This operator is used to compute the cross product of two or more relations
DEFINE This operator assigns an alias to a UDF or a streaming command
DISTINCT This operator will remove the duplicate tuples from a relation
FILTER It is used to generate the transformation for each statement as specified
FOREACH It selects the tuples for a relationship based on a specified condition
IMPORT This operator imports macros defined in a separate file
JOIN This operator performs the inner join of two or more relations based on common field values
LOAD This operator loads the data from a file system
MAPREDUCE This operator executes the native MapReduce jobs in a Pig script
ORDER BY This will sort the relation based on two or more fields
SAMPLE Divides the relation into two or more relations, and selects a random data sample based on a specified size
SPLIT This will partition the relation based on some conditions or expressions as specified
STORE This will store or save the result in a file system
STREAM This operator sends the data to an external script or program
UNION This operator is used to compute the unions of two or more relations

Certification in Bigdata Analytics

Diagnostic Operators:

Operator Description
Describe Returns the schema of the relation
Dump It will dump or display the result on the screen
Explain Displays execution plans
Illustrate It displays the step by step execution for the sequence of statements

Download a Printable PDF of this Cheat Sheet

We have covered all the basics of Pig Basics in this cheat sheet. If you want to start learning Pig Basics in-depth then check out the Hadoop Certification by Intellipaat.
Not only will you get to learn and implement Pig Basics with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!

Course Schedule

Name Date Details
Big Data Course 16 Nov 2024(Sat-Sun) Weekend Batch View Details
23 Nov 2024(Sat-Sun) Weekend Batch
30 Nov 2024(Sat-Sun) Weekend Batch

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.