+18 votes
2 views
in Big Data Hadoop & Spark by (10.5k points)

I have just started with Hadoop. Using Cloudera's Hadoop VM, I worked with Hive, Pig and Hadoop. As I worked, I realized that PIG and HIVE are doing the same job, they have the same purpose. So, I wanna know what is the difference between them, why do we need them both?

2 Answers

+12 votes
by (13.2k points)

I think your question can be answered in two ways,

The shorter explanation:

I believe that they are/were independent projects and there was no centrally coordinated goal.They were separated in their own sense when they were first introduced but as they evolved comparison and overlap seems to have been a case.

The detailed explanation:

Pig’s programming language aka Pig Latin is a coding approach that provides high degree of abstraction for MapReduce programming but is a procedural in nature not declarative. Pig Latin code is extended through varied user outlined functions that are written in Python, Java, Groovy, JavaScript, and Ruby. Pig has tools for information storage, information execution and information manipulation.In Hive on the other hand, tables and databases are created first and then data is loaded into these tables. It is more structured, resembles SQL.

Some of the comparison points are given below:

  • Apache Pig may be a scripting language and Hive may be a SQL like search language.

  • Hive requires very few lines of code when compared to Pig because of its SQL like resemblance.

  • Pig has issues in handling unstructured information like pictures, videos, audio, text that's unequivocally delimited, log data, etc.

  • Pig is faster in the data import but slower in actual execution to a language like Hive.

  • Pig has no metadata support, (or it is optional, in future it may integrate hcatalog). Hive has tables' metadata stored in database.

So to conclude, the purpose of both are different but under the hood, both do the same, convert to map reduce programs.

0 votes
ago by (42.2k points)

Hive is more reliable than PIG in Partitions, Server, Web interface & JDBC/ODBC support.

Some differences are as follows:

  1. Hive is most suitable for structured Data & PIG is most suitable for semi-structured data
  2. Hive is practiced for reporting & PIG for programming
  3. Hive is used as a declarative SQL & PIG is used as a procedural language
  4. Hive supports partitions & PIG does not
  5. Hive can start an optional thrift based server & PIG cannot
  6. Hive defines tables beforehand (schema) + stores schema information in a database & PIG doesn't have dedicated metadata of database
  7. Hive does not support Avro but PIG does. EDIT: Hive supports Avro, specify the serde as org.apache.hadoop.hive.serde2.avro
  8. Pig also supports additional COGROUP features for performing outer joins but hive does not. But both Hive & PIG can dynamically join, order & sort.

For more information regarding the same, refer the following link: 

 

...