0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
We have a large dataset to analyze with multiple reduce functions.

All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.

Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.

1 Answer

0 votes
by (24.8k points)

If you are expecting every reducer to work on exactly the same mapped data, then at least the "key" should be different.

Output can be written for multiple parts in mapper, and as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed between reducers, based on $i. Then using "GroupingComparator" you need to group records by original $key.

Simply, I would say just define your map output key to be a composite key that combines the metric-type and the actual key (for that metric) that would group the keys as metric-type and the actual key and the reducer can invoke a different reduction method depending on the metric type. 

Let's say you need two kinds of reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.

Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.

By default each reducer will generate a separate output file such as part-0000 and this output will be stored in HDFS. Now for merging all the reducers output to single file, we write our own code explicitly, using hadoop -fs getmerge command for multiple outputs.

...