PIG Built-in Functions Cheat Sheet

Pig Built-in Functions User Handbook

Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.

Pig built-in functions:

Type	Examples
EVAL functions	AVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN, SIZE etc
LOAD or STORE functions	Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage etc
Math functions	ABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc
String functions	TRIM, RTRIM, SUBSTRING, LOWER, UPPER etc
DateTime function	GetDay, GetHour, GetYear, ToUnixTime, ToString etc

Eval functions:

AVG(col): computes the average of the numerical values in a single column of a bag
CONCAT(string expression1, string expression2) : Concatenates two expressions of identical type
COUNT(DataBag bag): Computes the number of elements in a bag excluding null values
COUNT STAR (DataBag bag1, DataBag bag 2): Computes the number of elements in a bag including null values.
DIFF(DataBag bag1, DataBag bag2): It is used to compare two bags, if any element in one bag is not present in the other bag are returned in a bag
IsEmpty(DataBag bag), IsEmpty(Map map): It is used to check if the bag or map is empty
Max(col): Computes the maximum of the numeric values or character in a single column bag
MIN(col): Computes the minimum of the numeric values or character in a single column bag
DEFINE pluck pluckTuple(expression1): It allows the user to specify a string prefix, and filters the columns which begins with that prefix
SIZE(expression): Computes the number of elements based on any pig data
SUBSTRACT(DataBag bag1, DataBag bag2): It returns the bag which does not contain bag1 element in bag2
SUM: Computes the sum of the values in a single-column bag
TOKENIZE(String expression[,‘field delimiter’): It splits the string and outputs a bag of words

Watch this video on PIG by Intellipaat:

Load or Store Functions:

PigStorage ():

Syntax: PigStorage(field_delimiter)
A = LOAD ‘Employee’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);
Loads and stores data as structured text file

TextLoader():

Syntax: A = LOAD ‘data’ USING TextLoader();
Loads unstructured data in UTF 8 format

BinStorage():

Syntax: A = LOAD ‘data’ USING BinStorage();
Loads and stores data in machine readable format

Handling compression:

It loads and stores compressed data in Pig

JsonLoader, JsonStorage:

Syntax: A = load ‘a.json’ using JsonLoader();
It loads and stores JSON data

Pig dump:

Syntax: STORE X INTO ‘output’ USING PigDump ();
Stores data in UTF 8 format

Math functions:

ABS:

Syntax: ABS(expression)
It returns the absolute value of an expression

COS:

Syntax: COS(expression)
It Returns the trigonometric cosine of an expression.

SIN:

Syntax: SIN (expression)
It returns the sine of an expression.

CEIL:

Syntax: CEIL(expression)
It is used to return the value of an expression rounded up to the nearest integer

TAN:

Syntax: TAN(expression)
It is used to return the trigonometric tangent of an angle.

ROUND:

Syntax: ROUND(expression)
It returns the value of an expression rounded to an integer (if the result type is float) or long (if the result type is double)

RANDOM:

Synatx: RANDOM ()
It returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0

Floor:

Syntax: FLOOR(expression)
Returns the value of an expression rounded down to the nearest integer.

CBRT:

Synatx: CBRT(expression)
It returns the cube root of an expression

EXP:

Syntax: EXP(expression)
Returns Euler’s number e raised to the power of x.

String Functions:

INDEXOF:

Syntax: INDEXOF (string, ‘character’, startIndex)
It returns an index of the first occurrence of a character in a string

LAST_INDEX:

Syntax: LAST_INDEX_OF (expression)
It returns an index of the last occurrence of a character in a string

TRIM:

Syntax: TRIM(expression)
It returns a copy of the string with leading and trailing whitespaces removed

SUBSTRING:

Syntax: SUBSTRING (string, startIndex, stopIndex)
It will return a substring from a given string

UCFIRST:

Syntax: UCFIRST(expression)
It will return a string with the first character changed to the upper case

LOWER:

Syntax: LOWER(expression)
Converts all characters in a string to lowercase

UPPER:

Synatx: UPPER(expression)
Converts all characters in a string to the uppercase

Tuple, Bag and Map functions:

Function	Syntax	Description
TOTUPLE	TOTUPLE(expression [, expression …])	It is used to convert one or more expressions to the type Tuple
TOBAG	TOBAG(expression [, expression …])	It is used to convert one or more expression to the individual tuple, which is then placed in a bag
TOMAP	TOMAP(key-expression, value-expression [, key-expression, value-expression …])	It is used to convert key/value expression pairs to a Map
TOP	TOP(topN,column,relation)	Returns a top-n tuples from a bag of tuples

Download a Printable PDF of this Cheat Sheet

We have covered all the basics of Pig Built-in Functions in this cheat sheet. If you want to start learning Pig Built-in Functions in depth then check out the Hadoop Certification by Intellipaat.
Not only will you get to learn and implement Pig Built-in Functions with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!