PIG Built-in Functions Cheat Sheet

Pig Built-in Functions User Handbook

Are you a developer looking for a high-level scripting language to work on Hadoop? If yes, then you must take Apache Pig into your consideration. This Pig cheat sheet is designed for the one who has already started learning about the scripting languages like SQL and using Pig as a tool, then this sheet will be handy reference. Don’t worry if you are a beginner and have no idea about how Pig works, this cheat sheet will give you a quick reference of the basics that you must know to get started.
You can also download the printable PDF of Pig Built-in Functions Cheat Sheet.

Pig built-in functions:

Type Examples
EVAL functions AVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN, SIZE etc
LOAD or STORE functions Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage etc
Math functions ABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc
String functions TRIM, RTRIM, SUBSTRING, LOWER, UPPER etc
DateTime function GetDay, GetHour, GetYear, ToUnixTime, ToString etc

Eval functions:

  • AVG(col): computes the average of the numerical values in a single column of a bag
  • CONCAT(string expression1, string expression2) : Concatenates two expressions of identical type
  • COUNT(DataBag bag): Computes the number of elements in a bag excluding null values
  • COUNT STAR (DataBag bag1, DataBag bag 2): Computes the number of elements in a bag including null values.
  • DIFF(DataBag bag1, DataBag bag2): It is used to compare two bags, if any element in one bag is not present in the other bag are returned in a bag
  • IsEmpty(DataBag bag), IsEmpty(Map map): It is used to check if the bag or map is empty
  • Max(col): Computes the maximum of the numeric values or character in a single column bag
  • MIN(col): Computes the minimum of the numeric values or character in a single column bag
  • DEFINE pluck pluckTuple(expression1): It allows the user to specify a string prefix, and filters the columns which begins with that prefix
  • SIZE(expression): Computes the number of elements based on any pig data
  • SUBSTRACT(DataBag bag1, DataBag bag2): It returns the bag which does not contain bag1 element in bag2
  • SUM: Computes the sum of the values in a single-column bag
  • TOKENIZE(String expression[,‘field delimiter’): It splits the string and outputs a bag of words

Watch this video on PIG by Intellipaat:

Video Thumbnail

Load or Store Functions:

  • PigStorage ():

Syntax: PigStorage(field_delimiter)
A = LOAD ‘Employee’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);
Loads and stores data as structured text file

  • TextLoader():

Syntax: A = LOAD ‘data’ USING TextLoader();
Loads unstructured data in UTF 8 format

  • BinStorage():

Syntax: A = LOAD ‘data’ USING BinStorage();
Loads and stores data in machine readable format

  • Handling compression:

It loads and stores compressed data in Pig

  • JsonLoader, JsonStorage:

Syntax: A = load ‘a.json’ using JsonLoader();
It loads and stores JSON data

  • Pig dump:

Syntax: STORE X INTO ‘output’ USING PigDump ();
Stores data in UTF 8 format

Certification in Bigdata Analytics

Math functions:

  • ABS:

Syntax: ABS(expression)
It returns the absolute value of an expression

  • COS:

Syntax: COS(expression)
It Returns the trigonometric cosine of an expression.

  • SIN:

Syntax: SIN (expression)
It returns the sine of an expression.

  • CEIL:

Syntax: CEIL(expression)
It is used to return the value of an expression rounded up to the nearest integer

  • TAN:

Syntax: TAN(expression)
It is used to return the trigonometric tangent of an angle.

  • ROUND:

Syntax: ROUND(expression)
It returns the value of an expression rounded to an integer (if the result type is float) or long (if the result type is double)

  • RANDOM:

Synatx: RANDOM ()
It returns a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0

  • Floor:

Syntax: FLOOR(expression)
Returns the value of an expression rounded down to the nearest integer.

  • CBRT:

Synatx: CBRT(expression)
It returns the cube root of an expression

  • EXP:

Syntax: EXP(expression)
Returns Euler’s number e raised to the power of x.

String Functions:

  • INDEXOF:

Syntax: INDEXOF (string, ‘character’, startIndex)
It returns an index of the first occurrence of a character in a string

  • LAST_INDEX:

Syntax: LAST_INDEX_OF (expression)
It returns an index of the last occurrence of a character in a string

  • TRIM:

Syntax: TRIM(expression)
It returns a copy of the string with leading and trailing whitespaces removed

  • SUBSTRING:

Syntax: SUBSTRING (string, startIndex, stopIndex)
It will return a substring from a given string

  • UCFIRST:

Syntax: UCFIRST(expression)
It will return a string with the first character changed to the upper case

  • LOWER:

Syntax: LOWER(expression)
Converts all characters in a string to lowercase

  • UPPER:

Synatx: UPPER(expression)
Converts all characters in a string to the uppercase

Become a Big Data Architect

Tuple, Bag and Map functions:

Function Syntax Description
TOTUPLE TOTUPLE(expression [, expression …]) It is used to convert one or more expressions to the type Tuple
TOBAG TOBAG(expression [, expression …]) It is used to convert one or more expression to the individual tuple, which is then placed in a bag
TOMAP TOMAP(key-expression, value-expression [, key-expression, value-expression …]) It is used to convert key/value expression pairs to a Map
TOP TOP(topN,column,relation) Returns a top-n tuples from a bag of tuples

Download a Printable PDF of this Cheat Sheet

We have covered all the basics of Pig Built-in Functions in this cheat sheet. If you want to start learning Pig Built-in Functions in depth then check out the Hadoop Certification by Intellipaat.
Not only will you get to learn and implement Pig Built-in Functions with a step by step guidance and support from us, but also you will get 24*7 technical support to help you with any and all your queries, from the experts in the respective technologies here at intellipaat throughout the certification period. So, why wait? Check out the training program and enroll today!

Our Big Data Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 11th Jan 2025
₹22,743
Cohort starts on 1st Feb 2025
₹22,743
Cohort starts on 25th Jan 2025
₹22,743

About the Author

Technical Research Analyst - Big Data Engineering

Abhijit is a Technical Research Analyst specialising in Big Data and Azure Data Engineering. He has 4+ years of experience in the Big data domain and provides consultancy services to several Fortune 500 companies. His expertise includes breaking down highly technical concepts into easy-to-understand content.