Setting the number of map tasks and reduce tasks

Question

asked Jul 7, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not display. Can someone tell me what I am doing wrong. I am using this command

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

Output:

11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164
11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18
11/07/30 19:48:56 INFO mapred.JobClient: Job Counters
11/07/30 19:48:56 INFO mapred.JobClient: Launched reduce tasks=13
11/07/30 19:48:56 INFO mapred.JobClient: Rack-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient: Launched map tasks=24
11/07/30 19:48:56 INFO mapred.JobClient: Data-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient: FileSystemCounters
11/07/30 19:48:56 INFO mapred.JobClient: FILE_BYTES_READ=4020792636
11/07/30 19:48:56 INFO mapred.JobClient: HDFS_BYTES_READ=1556534680
11/07/30 19:48:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6026699058
11/07/30 19:48:56 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1928893942
11/07/30 19:48:56 INFO mapred.JobClient: Map-Reduce Framework
11/07/30 19:48:56 INFO mapred.JobClient: Reduce input groups=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Combine output records=0
11/07/30 19:48:56 INFO mapred.JobClient: Map input records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Reduce shuffle bytes=1974162269
11/07/30 19:48:56 INFO mapred.JobClient: Reduce output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Spilled Records=120000000
11/07/30 19:48:56 INFO mapred.JobClient: Map output bytes=1928893942
11/07/30 19:48:56 INFO mapred.JobClient: Combine input records=0
11/07/30 19:48:56 INFO mapred.JobClient: Map output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Reduce input records=40000000
[hcrc1425n30]s0907855:

1 Answer

Amit Rawat · Answer 1 · 2019-07-07T18:45:08+0000

First point to keep in mind is that the number of map tasks for a given job is handled by the number of input splits, not by the mapred.map.tasks parameter. For each input split a map task is released. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.

In your example, Hadoop has determined there are 24 input splits that will release 24 map tasks in total. But, eventually, it is you who has got the control regarding how many map tasks can be executed in parallel by each of the task trackers.

Also in your example, the -D parts are not picked up:

They should come after the classname part like this:

hadoop jar Test_Parallel_for.jar Test_Parallel_for -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=0 Matrix/test4.txt Result 3

You can refer the following video tutorial which will teach you MapReduce from scratch:

Setting the number of map tasks and reduce tasks

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources