Pig raises the level of abstraction for processing large datasets. It is a platform for analyzing large data sets that consists of a high level language for expressing data analysis programs. It is an open source language which is developed by yahoo.
Advantages of Pig
- Reuse the code
- Faster development
- Fewer lines of code
- Schema and type checking etc
Pig is made up of two pieces:
- First is language which is used to express data flows known as Pig Latin.
- Second one is execution environment to run Pig Latin programs. There are now two environments that are local execution in a single JVM and distributed execution on a Hadoop cluster.
A Pig Latin program is collection of a series of operations or transformations which are applied to the input data to generate output. The operations express a data flow that the pig execution environment transforms into an executable representation and then runs.
This blog will help you get a better understanding of What is Apache Pig?
What makes Pig Hadoop popular?
- It is easy to learn read and write if you know SQL.
- It uses a multi query approach.
- It provides large number of nested data types like as Maps, Tuples and Bags which are not available in MapReduce along with some data operations like Filters, Ordering and Joins.
- It contains different user groups for instance 90% of Yahoo’s MapReduce is done by Pig and 80% of Twitter’s MapReduce is also done by Pig and various other companies like Sales force, LinkedIn and Nokia etc. also use Pig.
Running Pig Programs
There are 3 ways of executing Pig programs all of which work in both local and MapReduce mode:
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.
Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts from within Grunt using run and exec.
You can execute Pig programs from Java like you can use JDBC to run SQL programs from Java.
Example: Word count in Pig
Lines=LOAD ‘input/hadoop.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN (TOKENIZE (line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT (Words); Results = ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO /output/top5words;