- Updated on: 06th Jan, 14
- 3795 Views
Apache Hadoop software library is basically a framework. It allows distributed processing, especially for the large data sets stored across the multiple clusters of computers. All it gets processed with the simple programming models. Several machines are functioned together here with the objective of analysis of the large sets of data. The concept of Hadoop was basically developed based on whitepapers released by Google.
Components of Hadoop:
Hadoop has basically two core components.
- HDFS or Hadoop Distributed File System
Hadoop Cluster – A Hadoop cluster is a set of machines which run these two components called HDFS and MapReduce. It consists of one to multiple nodes, which are also understood as the individual machines.
The structure consists of Master Node and Slave nodes. Master node consists of NameNode and Job Trackers. The slave nodes consist of DataNode and Task Tracker.
Working of Hadoop:
Initially, data is passed from the client to the Hadoop. The n it would be distributed to the NameNodes. Then the program is run on Hadoop and process the data.
Step 1: Initially the data is broken into the blocks of 64 Mb or 128 Mb and then are moved to the nodes.
Step 2:Then the program is passed by the Hadoop frameworks to run.
Step 3:The programs are then scheduled on the individual nodes by the Job Tracker.
Step 4:Once the program is executed, the output is returned.
To know more about structure and working of Hadoop watch the best hadoop tutorial of Intellipaat.
During the process of Hadoop, the data is loaded onto the Hadoop Filesystem called HDFS or Hadoop Distributed File System. This file system is based on the Google’s GFS. It is responsible for storage of data in the clusters. Blocks of 64 Mb or 128 Mb are replicated thrice to ensure that data is safe and not lost. This number of replication can be configured.
HDFS is the best suitable for the larger files in a small number. It is taken care by the NameNodes and DataNodes.
Working of HDFS:
The data replication into three nodes can be seen in the above picture.
– For processing data in the Hadoop system, MapReduce data processing is used
– It is a data processing component used by Hadoop
– The data processing task is attained by task distribution across the nodes
– This process is enabled in two phases called:
– In between these two phases Map and Reduce, there is another phase would come into picture called, Shuffle and Sort
Process of Mapping
In this process, the text in the text file is given as the input to it and output is given accordingly.
Process of Shuffle and Sort
In the process of Shuffle and Sort, all of the values are considered and brought together to shuffle and sort. So, the output is shows would be the keys and number of instances in the text file.
Process of Reducer:
The reducer process is executed according to the program written by you. For example, it sorts and provides the final reducer output formulated by Emit (k, sum)
Finally, the entire job process of Hadoop can be understood by the following flowchart.
Note: Click To enlarge Images