Hadoop YARN - Arcitecture, Components and Working

Before beginning the tutorial, let’s have a look at the agenda for this tutorial:

What is Hadoop YARN?

Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management layer in Hadoop. YARN came into the picture with the introduction of Hadoop 2.x. It allows various data processing engines such as interactive processing, graph processing, batch processing, and stream processing to run and process data stored in HDFS (Hadoop Distributed File System).

YARN was introduced to make the most out of HDFS, and job scheduling is also handled by YARN.

Now that YARN has been introduced, the architecture of Hadoop 2.x provides a data processing platform that is not only limited to MapReduce. It lets Hadoop process other-purpose-built data processing systems as well, i.e., other frameworks can run on the same hardware on which Hadoop is installed.

Now that you have learned what is YARN, let’s see why we need Hadoop YARN.

Before that, let’s watch this video tutorial on Hadoop:

Why is YARN Hadoop Used?

Despite being thoroughly proficient at data processing and computations, Hadoop 1.x had some shortcomings like delays in batch processing, scalability issues, etc. as it relied on MapReduce for processing big datasets. With YARN, Hadoop is now able to support a variety of processing approaches and has a larger array of applications. Hadoop YARN clusters are now able to run stream data processing and interactive querying side by side with MapReduce batch jobs. YARN framework runs even the non-MapReduce applications, thus overcoming the shortcomings of Hadoop 1.x.

Next, let’s discuss the Hadoop YARN architecture.

Hadoop YARN Architecture

Now, we will discuss the architecture of YARN. Apache YARN framework contains a Resource Manager (master daemon), Node Manager (slave daemon), and an Application Master.

Let’s now discuss each component of Apache Hadoop YARN one by one in detail.

Get 100% Hike!

Master Most in Demand Skills Now!

Resource Manager

Resource Manager is the master daemon of YARN. It is responsible for managing several other applications, along with the global assignments of resources such as CPU and memory. It is used for job scheduling. Resource Manager has two components:

Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals with the scheduling of tasks and hence it performs no tracking and no monitoring of applications.
Application Manager: The application Manager manages applications running in the cluster. Tasks, such as the starting of Application Master or monitoring, are done by the Application Manager.

Let’s move on with the second component of Apache Hadoop YARN.

Node Manager

Node Manager is the slave daemon of YARN. It has the following responsibilities:

Node Manager has to monitor the container’s resource usage, along with reporting it to the Resource Manager.
The health of the node on which YARN is running is tracked by the Node Manager.
It takes care of each node in the cluster while managing the workflow, along with user jobs on a particular node.
It keeps the data in the Resource Manager updated
Node Manager can also destroy or kill the container if it gets an order from the Resource Manager to do so.

The third component of Apache Hadoop YARN is the Application Master.

Application Master

Every job submitted to the framework is an application, and every application has a specific Application Master associated with it. Application Master performs the following tasks:

It coordinates the execution of the application in the cluster, along with managing the faults.
It negotiates resources from the Resource Manager.
It works with the Node Manager for executing and monitoring other components’ tasks.
At regular intervals, heartbeats are sent to the Resource Manager for checking its health, along with updating records according to its resource demands.

Now, we will step forward with the fourth component of Apache Hadoop YARN.

Container

A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node. The tasks of a container are listed below:

It grants the right to an application to use a specific amount of resources (memory, CPU, etc.) on a specific host.
YARN containers are particularly managed by a Container Launch context which is Container Life Cycle (CLC). This record contains a map of environment variables, dependencies stored in remotely accessible storage, security tokens, the payload for Node Manager services, and the command necessary to create the process.

How does Apache Hadoop YARN work?

YARN separates HDFS and MapReduce, making the Hadoop environment more suitable for applications that can’t wait for the batch processing jobs to get finished. So, no more batch processing delays with YARN! This architecture lets you process data with multiple processing engines using real-time streaming, interactive SQL, batch processing, handling of data stored in a single platform, and working with analytics in a completely different manner. It can be considered as the basis of the next generation of the Hadoop ecosystem, ensuring that the forward-thinking organizations are realizing the modern data architecture.

How is an application submitted in Hadoop YARN?

1. Submit the job
2. Get an application ID
3. Retrieval of the context of application submission

Start Container Launch
Launch Application Master

4. Allocate Resources.

Container
Launching

5. Executing

Workflow of an Application in Apache Hadoop YARN

Submission of the application by Client
Container allocation for starting Application Manager
Registering the Application Manager with Resource Manager
Application Manager asks for containers from Resource Manager
Application Manager notifies Node Manager to launch containers
Application code gets executed in the container
Client contacts Resource Manager/Application Manager to monitor the status of the application
Application Manager gets disconnected with Resource Manager

Features of Hadoop YARN

High-degree compatibility: Applications created use the MapReduce framework that can be run easily on YARN.
Better cluster utilization: YARN allocates all cluster resources efficiently and dynamically, which leads to better utilization of Hadoop as compared to the previous version of it.
Utmost scalability: Whenever there is an increase in the number of nodes in the Hadoop cluster, the YARN Resource Manager assures that it meets the user requirements.
Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently work together all because of YARN as it is a highly versatile technology.

YARN vs MapReduce

In Hadoop 1.x, the batch processing framework MapReduce was closely paired with HDFS. With the addition of YARN to these two components, giving birth to Hadoop 2.x, came a lot of differences in how Hadoop worked. Let’s go through these differences.

Criteria	YARN	MapReduce
Type of processing	Real-time, batch, and interactive processing with multiple engines	Silo and batch processing with a single-engine
Cluster resource optimization	Excellent due to central resource management	Average due to fixed Map and Reduce slots
Suitable for	MapReduce and non-MapReduce applications	Only MapReduce applications
Managing cluster resource	Done by YARN	Done by JobTracker
Namespace	Hadoop supports multiple namespaces	Supports only one namespace, i.e., HDFS

In this section of the Hadoop tutorial, we learned about YARN in-depth. In the next section of this tutorial, we shall be talking about Streaming in Hadoop.