0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.

To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.

In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.

So how is this really done? How does this MapReduce sorting algorithm work?

1 Answer

0 votes
by (24.8k points)

Hadoop provides a MapReduce implementation that manages distributed computation.

MapReduce algorithm consists of two very important tasks:

Map Task

Reduce Task

Mapping and reducing done with the insights of Mapper and Reducer class, respectively.

Mapper class functions with taking the input, tokenizing it, mapping it, and finally sorting it. Then, the output of the Mapper class is transferred as an input to Reducer class, where the searching of matching pairs is done, followed by reducing.

Sorting is the basic MapReduce algorithm that processes and analyzes the given data. The sorting algorithm is implemented by MapReduce to sort the output key-value pairs from the mapper with respect to their keys.

  • Sorting methods are applied within the mapper class.

  • In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the user-defined(Context) class gets the matching valued keys as a collection.

  • The RawComparator class helps the Mapper class to collect similar key-value pairs (intermediate keys),and sort them.

  • Before the final values are produced to the Reducer, the set of intermediate key-value pairs for a given Reducer is automatically sorted by Hadoop to form key-values (K2, {V2, V2, …}).


 

TeraSort Benchmark - It is used to sort the given data as fast as possible to benchmark the performance of the MapReduce framework.

TeraSort is a standard MapReduce sort. It reads the input data and uses MapReduce to sort the data. The following command sorts the data that is generated by TeraGen(another MapReduce program that generates large datasets for sorting):


 

mrsh jar $PMR_HOME/version/os_type/samples/hadoop-0.20.2-examples.jar terasort

...