In Hadoop programming environment, MapReduce is the heart. Hadoop MapReduce ensures massive scalabilities in the Hadoop cluster that is made of numerous servers. People conversant with the concept of clustered scale-out data processing would find comprehending MapReduce concepts in Hadoop easier.
It might not be as easy for new entrants to the arena but a little discussion about the characteristics of Hadoop MapReduce would help them understand what it is and how it works.
MapReduce is basically a combination of two distinct tasks carried out by the Hadoop programming environment. Map job is taking particular sets of data and converting them into different data sets. Individual elements in the sets are broken into tuples or key/value pairs.
Reduce job is obtaining the output generated by Map as input and then combining them into smaller tuple sets. Reduce job would always follow the Map job.
Functioning of MapReduce can be understood by taking the simple example of files and columns. For instance; the user can have several files with two columns; one representing the key s and other the value in the Hadoop cluster. Real life tasks may not be as easy but this is how the Hadoop MapReduce works.
Real time MapReduce tasks may contain millions of rows and columns and they may not have been well formatted either. But the basic principles of working of MapReduce would always remain the same. An example could be city as the key and rainfall quantum as the value providing the key/value pair needed for MapReduce functioning.
Data collected and stored would help find out the maximum and minimum rainfall per city using the MapReduce function. If there are ten files then they would be broken down into ten Map tasks. Each of the mapper would work on one of the files, assess the data and return the maximum rainfall or minimum rainfall whatever is required in each of the cases.
Output streamed from all the ten files would thereafter be fed into the Reduce task. The Reduce function would combine all the input results as well as output generating single value for each of the cities. Thereby a common final result sheet would be generated showing the rainfall in each of the cities.
The process is simple. It is like tasks carried out when computers and information technologies had not evolved this far and everything was done manually. People were sent to different places to collect the data and they would return and submit the collected data at their head office. This is exactly how the Map function in MapReduce works.
Similarly, in the head office the data collected and results generated by each surveyor would be reduced to single count. This would determine the overall results derived from such survey. This is exactly how the Reduce function in MapReduce works.
Used properly, MapReduce is extremely efficient in generating data sets and use key/value pairs to bring out results. That is why it is considered to be the heart of Hadoop programming and without the MapReduce, Hadoop won’t be what it is.