The highlights of Hadoop MapReduce
MapReduce is the framework that is used for processing large amounts of data on commodity hardware on a cluster ecosystem. The MapReduce is a powerful method of processing data when there are very huge amounts of node connected to the cluster. The two important tasks of the MapReduce algorithm are, as the name suggests – Map and Reduce.
The goal of the Map task is to take a large set of data and convert it into another set of data that is distinctly broken down into tuples or Key/Value pairs. Next the Reduce task takes the tuple which is the output of the Map task and makes the input for a reduction task. Here the data tuples are converted into a still smaller set of tuples. The Reduce task always follows the Map task.
The biggest strength of the MapReduce framework is scalability. Once a MapReduce program is written it can easily be extrapolated to work over a cluster which has hundreds or even thousands of nodes. In this framework, computation is sent to where the data resides.
The common terminology used in the MapReduce framework is as follows:
- PayLoad: both the Map and Reduce functions are implemented by the PayLoad applications which are the two most vital functions
- Mapper: the function of this application is to take the input/value pair and map it to a set of intermediate key/value pair
- NameNode: this is the node that is associated with HDFS
- DataNode: this is the node where the data is residing before the computation
- MasterNode: this is the node that takes job requests from the client and it is where the JobTracker runs
- SlaveNode: this is the node where both the Map and the Reduce tasks are run
- JobTracker: the jobs are scheduled here and the tracking of the jobs are reported here
- TaskTracker: it actually tracks the jobs and reports to the JobTracker with the status
- Task: it is the execution of the Mapper or the Reducer on a set of data