Explore Courses Blog Tutorials Interview Questions
+13 votes
in Big Data Hadoop & Spark by (1.5k points)

Hi, I am using MapReduce and there are so many steps in it. I want the value of the last reduced step as that is what is needed.

Can anyone suggest me a method to get rid of chaining multiple MapReduce Jobs in Hadoop? Please give an example (if possible) for cleanup of these  

P.s. Thank you in advance.

2 Answers

+13 votes
by (13.2k points)

There are multiple methods for the same, I’ll explain one of them here, you will be able to easily chain jobs together in this manner by writing multiple driver methods, one for each of them. First Call the first driver method, that uses JobClient.runJob() to run the job and wait for its completion. When the job has completed, then call next driver method, which will create a new JobConf object referring to different instances.

Create the JobConf object "one" for the first job.

Execute this job:

Then, create another JobConf object "two" for the next job.

Execute this job:

0 votes
by (32.3k points)

You apply the JobClient.runJob(). The output path of the data from the original job becomes the input path to your second job. These need to be passed in as parameters to your jobs with appropriate code to parse them and set up the parameters for the job.

I think that the above method might, however, be the way the now older mapred API did it, but it should still work. There will be a similar method in the new MapReduce API but I'm not sure what it is.

As far as eliminating intermediate data after a job has finished you can do this in your code. The way I've done it before is using something like:

FileSystem.delete(Path f, boolean recursive);

Where the path is the location on HDFS of the data. You need to make sure that you only delete this data once no other job requires it.

Refer the following video regarding Hadoop:

Browse Categories