Pig Built-in Functions User Handbook
Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.
Pig built-in functions:
Type |
Examples |
EVAL functions |
AVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN, SIZE etc |
LOAD or STORE functions |
Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage etc |
Math functions |
ABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc |
String functions |
TRIM, RTRIM, SUBSTRING, LOWER, UPPER etc |
DateTime function |
GetDay, GetHour, GetYear, ToUnixTime, ToString etc |
Eval functions:
- AVG(col): computes the average of the numerical values in a single column of a bag
- CONCAT(string expression1, string expression2) : Concatenates two expressions of identical type
- COUNT(DataBag bag): Computes the number of elements in a bag excluding null values
- COUNT STAR (DataBag bag1, DataBag bag 2): Computes the number of elements in a bag including null values.
- DIFF(DataBag bag1, DataBag bag2): It is used to compare two bags, if any element in one bag is not present in the other bag are returned in a bag
- IsEmpty(DataBag bag), IsEmpty(Map map): It is used to check if the bag or map is empty
- Max(col): Computes the maximum of the numeric values or character in a single column bag
- MIN(col): Computes the minimum of the numeric values or character in a single column bag
- DEFINE pluck pluckTuple(expression1): It allows the user to specify a string prefix, and filters the columns which begins with that prefix
- SIZE(expression): Computes the number of elements based on any pig data
- SUBSTRACT(DataBag bag1, DataBag bag2): It returns the bag which does not contain bag1 element in bag2
- SUM: Computes the sum of the values in a single-column bag
- TOKENIZE(String expression[,‘field delimiter’): It splits the string and outputs a bag of words
Watch this video on PIG by Intellipaat:
Load or Store Functions:
Syntax: PigStorage(field_delimiter)
A = LOAD ‘Employee’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);
Loads and stores data as structured text file
Syntax: A = LOAD ‘data’ USING TextLoader();
Loads unstructured data in UTF 8 format
Syntax: A = LOAD ‘data’ USING BinStorage();
Loads and stores data in machine readable format
It loads and stores compressed data in Pig
Syntax: A = load ‘a.json’ using JsonLoader();
It loads and stores JSON data
Syntax: STORE X INTO ‘output’ USING PigDump ();
Stores data in UTF 8 format
Math functions:
Syntax: ABS(expression)
It returns the absolute value of an expression
Syntax: COS(expression)
It Returns the trigonometric cosine of an expression.
Syntax: SIN (expression)
It returns the sine of an expression.
Syntax: CEIL(expression)
It is used to return the value of an expression rounded up to the nearest integer
Syntax: TAN(expression)
It is used to return the trigonometric tangent of an angle.
Syntax: ROUND(expression)
It returns the value of an expression rounded to an integer (if the result type is float) or long (if the result type is double)
Synatx: RANDOM ()
It returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0
Syntax: FLOOR(expression)
Returns the value of an expression rounded down to the nearest integer.
Synatx: CBRT(expression)
It returns the cube root of an expression
Syntax: EXP(expression)
Returns Euler’s number e raised to the power of x.
String Functions:
Syntax: INDEXOF (string, ‘character’, startIndex)
It returns an index of the first occurrence of a character in a string
Syntax: LAST_INDEX_OF (expression)
It returns an index of the last occurrence of a character in a string
Syntax: TRIM(expression)
It returns a copy of the string with leading and trailing whitespaces removed
Syntax: SUBSTRING (string, startIndex, stopIndex)
It will return a substring from a given string
Syntax: UCFIRST(expression)
It will return a string with the first character changed to the upper case
Syntax: LOWER(expression)
Converts all characters in a string to lowercase
Synatx: UPPER(expression)
Converts all characters in a string to the uppercase
Tuple, Bag and Map functions:
Function |
Syntax |
Description |
TOTUPLE |
TOTUPLE(expression [, expression …]) |
It is used to convert one or more expressions to the type Tuple |
TOBAG |
TOBAG(expression [, expression …]) |
It is used to convert one or more expression to the individual tuple, which is then placed in a bag |
TOMAP |
TOMAP(key-expression, value-expression [, key-expression, value-expression …]) |
It is used to convert key/value expression pairs to a Map |
TOP |
TOP(topN,column,relation) |
Returns a top-n tuples from a bag of tuples |
We have covered all the basics of Pig Built-in Functions in this cheat sheet. If you want to start learning Pig Built-in Functions in depth then check out the Hadoop Certification by Intellipaat.
Not only will you get to learn and implement Pig Built-in Functions with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!