Flat 20% & up to 50% off + Free additional Courses. Hurry up!

MapReduce and Yarn


Mapreduce is a data processing component of Hadoop. It is a programming model for processing large data sets. It attains the task of data processing by distributing tasks across the nodes. It consists of two phases –

  • Map
  • Reduce

Map converts a dataset into another set of data where individual elements are divided into key/value pairs.

Reduce task takes the output from a map as an input and then integrate data tuples into a smaller set of tuples. It is always executed after the map job.


Features of Mapreduce system

It have following features:

  • Provides framework for Mapreduce execution
  • Abstracts developer from the complexity of distributed programming
  • Partial failure of the processing cluster is expected and tolerable
  • Redundancy and fault tolerance is built in
  • Mapreduce programming model is language independent
  • Automatic parallelization and distribution
  • Fault tolerance
  • Enable data local processing
  • Shared nothing architectural model
  • Manages inter process communication
  • Managing the distributed servers running the various tasks in parallel
  • Managing all communications and data transfers between the various part of system
  • Providing for redundancy and failures and overall management of the whole process.


Mapreduce simple steps

  1. Executes map function on each input received
  2. Map function emits key, value pair
  3. Shuffle, Sort and Group the outputs
  4. Executes reduce function on the group
  5. Emits the output per group




Map Function

It operates on each key/value pair of data and transforms the data which based on the transformation logic provided in the map function. Map function always produces a key/value pair as output.

Map (key1, value1) ->List (key2, value2)


Reduce Function

It takes list of value for every key and transforms the data based on the (aggregation) logic provided in the reduce function.

Reduce (key2, List (value2)) ->List (key3, value3)


Map Function for Word Count

private final staic IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map (LongWritable key, Text value, Context context)

throws IOException, InterruptedException{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);



context.write(word, one);




Reduce Function for Word Count

public void reduce(Text key, Iterable<IntWritable> values, Context context)  throws IOException,


int sum = 0;

for(IntWritable val: values){



Context.write(key, new IntWriatble(sum));


mapreduce process




  • PayLoad– Applications that implement the Map and Reduce functions.
  • Mapper– Application that maps the input key/value pairs to a set of intermediate key/value pair.
  • NamedNode– Node that manages the HDFS.
  • DataNode– Node where data is presented in a before any processing takes place.
  • MasterNode– Node where JobTracker runs and receives job requests from clients.
  • SlaveNode– Map and Reduce program run in this node.
  • JobTracker– Schedules jobs and tracks the assign jobs to Task tracker.
  • Task Tracker– Status is reported to JobTracker after the task is tracked.
  • Job– It is an execution of a Mapper and Reducer.
  • Task– An execution of a Mapper or a Reducer on a slice of data.
  • Task Attempt– An attempt to execute a task ona SlaveNode.



Yarn stands for yet another resource negotiator. It is a cluster management technology which is an open source distributed processing framework. The objective of YARN is to construct a framework on Hadoop that permits cluster resources to be allocated to given applications and for MapReduce to be only one of these applications.

It separates the tasks of the jobtracker into separate entities. The job tracker maintains track of both job scheduling which match tasks with task trackers and another one is task progress monitoring that take care of tasks and start again the failed or slow tasks and doing task bookkeeping like as maintaining counter totals.

It divides these two roles into two independent daemons that are a resource manager which manage the use of resources across the cluster and an application master which manage the lifecycle of applications running on the cluster.

Application master agrees with the resource manager for cluster resources which is expressed in terms of a number of containers each with a certain memory limit then runs application specific processes in those containers.

The containers are handled by node managers running on cluster nodes which ensure that the application does not use more resources than it has been allocated.


MapReduce on YARN

MapReduce on YARN includes more entities than classic MapReduce. They are:

  • Client – It submits the MapReduce job.
  • YARN resource manager – It manages the allocation of compute resources on the cluster.
  • YARN node managers – It launches and monitors the compute containers on machines in the cluster.
  • MapReduce application master – It manages the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers which are scheduled by the resource manager and managed by the node managers.
  • Distributed filesystem (Normally HDFS) – It shares job files between the other entities.


mapreduce on yarn

"0 Responses on MapReduce and Yarn"

Training in Cities

Bangalore, Hyderabad, Chennai, Delhi, Kolkata, UK, London, Chicago, San Francisco, Dallas, Washington, New York, Orlando, Boston

100% Secure Payments. All major credit & debit cards accepted Or Pay by Paypal.


Sales Offer

  • To avail this offer, enroll before 24th October 2016.
  • This offer cannot be combined with any other offer.
  • This offer is valid on selected courses only.
  • Please use coupon codes mentioned below to avail the offer
DW offer

Sign Up or Login to view the Free MapReduce and Yarn.