Hadoop is an open-source distributed framework for storing and processing large datasets. It comprises of mainly two layers :
HDFS is a File System that provides Hadoop with storing capability for a huge amount of data and that too in a fault-tolerant manner with no risk of data loss as it creates replications of each set of data in different blocks across the cluster. But since it is a file system, it lacks behind in accessing data randomly from someplace in the file. Here, HBase comes into the actions, as in Hadoop can only perform batch processing and data will be allowed to access only in a sequential manner.
HBase is a column-oriented database built on top of Hadoop, i.e. horizontally scalable. This column fashion provides random read-write access to the data present in the Hadoop File System. It stores data as a key-value pair and is created after Google’s big table.
The Hadoop ecosystem contains various tools such as Scoop, Pig, Hive, etc.
Now, Hive is a data warehouse tool that exists on top of Hadoop and is used to process structured data. It is used to extract data from HDFS using SQL typescripts, i.e. Hive Query Language. It is designed for OLAP(Online Analytical Processing). Here, you can map your existing tables of HBase to Hive and use them. Also, you are allowed to create separate tables in Hive and store data inside it for better access to the data. You can perform any Hive operations on such tables.
Last but not the least, Apache Pig is a data flow language that gives liberty to the users to read and process data from one or more input sources and then store data as one or more outputs.
It comes with a high-level language Pig Latin for writing data analysis programs, using pig scripts. In this language, we have various operators that help programmers to develop their own functions to read, write and process data. To manipulate data in Hadoop, we can use Pig by performing data manipulation operations.
You can run Apache Pig in two modes -
In local mode all the files run within your local file system, this is basically used for testing purposes. Whereas in HDFS mode we load and process the HDFS data directly using Pig.
Now, MapReduce and HBase they do not have to do much with each other, they both run for Hadoop. MapReduce is a compiled language for processing and computing data based on java whereas HBase resides on top of Hadoop to give fast access to a huge amount of data with random real-time read-write access because of its column-oriented database.
Talking about Hive and Pig both of them does a job for MapReduce operations in Hadoop. Pig is a data flow language that performs data manipulation operations for Hadoop and analyzes a huge amount of data in an efficient manner using its Pig Latin Scripts.
While Hive provides SQL like language, i.e. Hive query language for better querying and processing of data. It requires very few lines of code as compared to Hadoop MapReduce and Pig.
You can also refer the following video for more information regarding the same: