What is YARN in Hadoop?
So, what is YARN in Hadoop? Apache YARN (Yet Another Resource Negotiator) is a resource management layer in Hadoop. YARN came into the picture with the introduction of Hadoop 2.x. It allows various data processing engines such as interactive processing, graph processing, batch processing, and stream processing to run and process data stored in HDFS (Hadoop Distributed File System).
Before beginning the tutorial, let’s have a look at the agenda for this tutorial:
YARN was introduced to make the most out of HDFS, and job scheduling is also handled by YARN.
Now that YARN has been introduced, the architecture of Hadoop 2.x provides a data processing platform that is not only limited to MapReduce. It lets Hadoop process other-purpose-built data processing systems as well, i.e., other frameworks can run on the same hardware on which Hadoop is installed.
Now that you have learned what is YARN, let’s see why we need Hadoop YARN.
Before that, let’s watch this video tutorial on Hadoop:
Why is YARN in Hadoop used?
Despite being thoroughly proficient at data processing and computations, Hadoop 1.x had some shortcomings like delays in batch processing, scalability issues, etc. as it relied on MapReduce for processing big datasets. With YARN, Hadoop is now able to support a variety of processing approaches and has a larger array of applications. Hadoop YARN clusters are now able to run stream data processing and interactive querying side by side with MapReduce batch jobs. YARN framework runs even the non-MapReduce applications, thus overcoming the shortcomings of Hadoop 1.x.
Next, let’s discuss the Hadoop YARN architecture.
Hadoop YARN Architecture
Now, we will discuss the architecture of YARN. Apache YARN framework contains a Resource Manager (master daemon), Node Manager (slave daemon), and an Application Master.
Let’s now discuss each component of Apache Hadoop YARN one by one in detail.
Resource Manager is the master daemon of YARN. It is responsible for managing several other applications, along with the global assignments of resources such as CPU and memory. It is used for job scheduling. Resource Manager has two components:
- Scheduler: Schedulers’ task is to distribute resources to the running applications. It only deals with the scheduling of tasks and hence it performs no tracking and no monitoring of applications.
- Application Manager: The application Manager manages applications running in the cluster. Tasks, such as the starting of Application Master or monitoring, are done by the Application Manager.
Let’s move on with the second component of Apache Hadoop YARN.
Do you still have queries on ‘Hadoop YARN?,’ do post them on our Big Data Hadoop and Spark Community!
Node Manager is the slave daemon of YARN. It has the following responsibilities:
- Node Manager has to monitor the container’s resource usage, along with reporting it to the Resource Manager.
- The health of the node on which YARN is running is tracked by the Node Manager.
- It takes care of each node in the cluster while managing the workflow, along with user jobs on a particular node.
- It keeps the data in the Resource Manager updated
- Node Manager can also destroy or kill the container if it gets an order from the Resource Manager to do so.
The third component of Apache Hadoop YARN is the Application Master.
Every job submitted to the framework is an application, and every application has a specific Application Master associated with it. Application Master performs the following tasks:
- It coordinates the execution of the application in the cluster, along with managing the faults.
- It negotiates resources from the Resource Manager.
- It works with the Node Manager for executing and monitoring other components’ tasks.
- At regular intervals, heartbeats are sent to the Resource Manager for checking its health, along with updating records according to its resource demands.
Now, we will step forward with the fourth component of Apache Hadoop YARN.
A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node. The tasks of a container are listed below:
- It grants the right to an application to use a specific amount of resources (memory, CPU, etc.) on a specific host.
- YARN containers are particularly managed by a Container Launch context which is Container Life Cycle (CLC). This record contains a map of environment variables, dependencies stored in remotely accessible storage, security tokens, the payload for Node Manager services, and the command necessary to create the process.
Read in-depth about Big Data Analytics from this blog!
How does Apache Hadoop YARN work?
YARN separates HDFS and MapReduce, making the Hadoop environment more suitable for applications that can’t wait for the batch processing jobs to get finished. So, no more batch processing delays with YARN! This architecture lets you process data with multiple processing engines using real-time streaming, interactive SQL, batch processing, handling of data stored in a single platform, and working with analytics in a completely different manner. It can be considered as the basis of the next generation of the Hadoop ecosystem, ensuring that the forward-thinking organizations are realizing the modern data architecture.
What is Hadoop? Check out the Big Data Hadoop Training in Sydney and learn more!
How is an application submitted in YARN?
1. Submit the job
2. Get an application ID
3. Retrieval of the context of application submission
- Start Container Launch
- Launch Application Master
4. Allocate Resources.
Workflow of an Application in YARN
- Submission of the application by Client
- Container allocation for starting Application Manager
- Registering the Application Manager with Resource Manager
- Application Manager asks for containers from Resource Manager
- Application Manager notifies Node Manager to launch containers
- Application code gets executed in the container
- Client contacts Resource Manager/Application Manager to monitor the status of the application
- Application Manager gets disconnected with Resource Manager
Features of YARN
- High-degree compatibility: Applications created use the MapReduce framework that can be run easily on YARN.
- Better cluster utilization: YARN allocates all cluster resources efficiently and dynamically, which leads to better utilization of Hadoop as compared to the previous version of it.
- Utmost scalability: Whenever there is an increase in the number of nodes in the Hadoop cluster, the YARN Resource Manager assures that it meets the user requirements.
- Multi-tenancy: Various engines that access data on the Hadoop cluster can efficiently work together all because of YARN as it is a highly versatile technology.
In Hadoop 1.x, the batch processing framework MapReduce was closely paired with HDFS. With the addition of YARN to these two components, giving birth to Hadoop 2.x, came a lot of differences in how Hadoop worked. Let’s go through these differences.
|Type of processing
||Real-time, batch, and interactive processing with multiple engines
||Silo and batch processing with a single-engine
|Cluster resource optimization
||Excellent due to central resource management
||Average due to fixed Map and Reduce slots
||MapReduce and non-MapReduce applications
||Only MapReduce applications
|Managing cluster resource
||Done by YARN
||Done by JobTracker
||Hadoop supports multiple namespaces
||Supports only one namespace, i.e., HDFS
In this section of the Hadoop tutorial, we learned about YARN in-depth. In the next section of this tutorial, we shall be talking about Streaming in Hadoop.
What is Hadoop? Enroll in our Big Data Hadoop Training now and learn in detail!