+22 votes
2 views
in Big Data Hadoop & Spark by (10.5k points)

What is the difference between Hadoop, HBase, Hive and Pig? I know the basic Definitions of all these terms, But I wanted to know about the major differences between them and where all can these be used?

2 Answers

+12 votes
by (13.2k points)

Hadoop

Hadoop is associate degree open supply project of the Apache foundation, it's a framework written in Java, originally developed by Doug Cutting in 2005, it was created to support distribution for Nutch, the text program. Hadoop uses Google's Map scale back and Google classification system Technologies as its foundation. Some of the major features of Hadoop are given below:

  1. Hadoop Is Easily Scalable, what that means is new nodes can easily be added to the existing data, which makes it ideal to be used in open source projects.

  2. Hadoop Is Fault Tolerant, it gets this reputation as the data is stored up in HDFS where the data is automatically gets replicated to other places.

  3. It is great at faster data processing, which is attributable to its ability to try and do multiprocessing, hadoop will perform batch processes ten times quicker than on one thread server or on the mainframe.

Comparison

Coming onto the comparison, both Pig and Hive are high-level languages that compile to MapReduce. HBase is totally different in its own way, it permits Hadoop to support lxookups/transactions on key/value pairs. HBase permits

1. fast random lookups, versus scan all of information consecutive,

2. insert/update/delete from middle, not the simple add/append.

Now, coming onto Pig and Hive

  1. Pig does not need underlying structure to the info, but Hive will imply structure via a metastore, what that does is makes Pig more suitable for ETL tasks. On the other hand, Hive’s metastore offer a dictionary which lets you see more easily.

  2. Hive requires very few lines of code when compared to Pig because of its SQL like resemblance, basically it is a subset of SQL with very simple variations to enable mapreduce-like computation.

  3. Pig is faster in the data import but slower in actual execution to a language like Hive.

0 votes
by (25.6k points)
edited ago by

Hadoop is an open-source distributed framework for storing and processing large datasets. It comprises of mainly two layers :

  • HDFS(Storage layer)

  • MapReduce(Processing layer)

HDFS is a File System that provides Hadoop with storing capability for a huge amount of data and that too in a fault-tolerant manner with no risk of data loss as it creates replications of each set of data in different blocks across the cluster. But since it is a file system, it lacks behind in accessing data randomly from someplace in the file. Here, HBase comes into the actions, as in Hadoop can only perform batch processing and data will be allowed to access only in a sequential manner.

HBase is a column-oriented database built on top of Hadoop, i.e. horizontally scalable. This column fashion provides random read-write access to the data present in the Hadoop File System. It stores data as a key-value pair and is created after Google’s big table.

The Hadoop ecosystem contains various tools such as Scoop, Pig, Hive, etc.

Now, Hive is a data warehouse tool that exists on top of Hadoop and is used to process structured data. It is used to extract data from HDFS using SQL typescripts, i.e. Hive Query Language. It is designed for OLAP(Online Analytical Processing). Here, you can map your existing tables of HBase to Hive and use them. Also, you are allowed to create separate tables in Hive and store data inside it for better access to the data. You can perform any Hive operations on such tables.

Last but not the least, Apache Pig is a data flow language that gives liberty to the users to read and process data from one or more input sources and then store data as one or more outputs.

It comes with a high-level language Pig Latin for writing data analysis programs, using pig scripts. In this language, we have various operators that help programmers to develop their own functions to read, write and process data. To manipulate data in Hadoop, we can use Pig by performing data manipulation operations.

You can run Apache Pig in two modes -

  • Local mode

  • HDFS mode

In local mode all the files run within your local file system, this is basically used for testing purposes. Whereas in HDFS mode we load and process the HDFS data directly using Pig.

Now, MapReduce and HBase they do not have to do much with each other, they both run for Hadoop. MapReduce is a compiled language for processing and computing data based on java whereas HBase resides on top of Hadoop to give fast access to a huge amount of data with random real-time read-write access because of its column-oriented database.

Talking about Hive and Pig both of them does a job for MapReduce operations in Hadoop. Pig is a data flow language that performs data manipulation operations for Hadoop and analyzes a huge amount of data in an efficient manner using its Pig Latin Scripts.

While Hive provides SQL like language, i.e. Hive query language for better querying and processing of data. It requires very few lines of code as compared to Hadoop MapReduce and Pig.

You can also refer the following video for more information regarding the same:

...