Introduction to Apache Hadoop YARN

Learn via video courses
Topics Covered

Overview

Introduced in the second version of Hadoop, YARN (Yet Another Resource Negotiator) is used to manage clusters. However, when YARN in Hadoop was launched, the Apache Software Foundation (ASF) described it as a redesigned resource manager, and it is now known as a large-scale distributed operating system for big data applications. YARN brings new functionality to Apache Hadoop by decoupling resource management and scheduling functions. Now, with YARN in Hadoop, it is possible to run interactive queries on Hadoop while streaming data.

Introduction

YARN, known as Yet Another Resource Negotiator, was introduced in Hadoop version 2.0 by Yahoo and Hortonworks in 2012. The basic idea of YARN in Hadoop is to divide the functions of resource management and task scheduling/monitoring into separate daemon processes.

YARN in Hadoop allows for the execution of various data processing engines such as batch processing, graph processing, stream processing, and interactive processing, as well as the processing of data stored in HDFS.

Why is YARN Hadoop Used?

Before Hadoop 2.0, Hadoop 1.x had two main components: the Hadoop Distributed File System (HDFS) and MapReduce. The MapReduce batch processing framework was tightly coupled to HDFS.

By relying solely on MapReduce, Hadoop ran into many challenges.

  • MapReduce had to handle both resource management and processing.
  • Job Tracker was overloaded with many features that it had to manage, including planning, task control, processing, resource allocation, etc
  • One Job Tracker was the bottleneck in terms of scalability.
  • Overall, the system was computationally inefficient in terms of resource usage.
  • Systems with Hadoop 1.0 can only run MapReduce applications.

Why is YARN Hadoop Used

Why is YARN Used?

  • YARN in Hadoop efficiently and dynamically allocates all cluster resources, resulting in higher Hadoop utilization compared to previous versions which help in better cluster utilization.
  • Clusters in YARN in Hadoop can now run streaming data processing and interactive queries in parallel with MapReduce batch jobs.
  • All thanks to YARN in Hadoop, it can now handle several processing methods and can support a wider range of applications.

How does Apache Hadoop YARN Work?

The basic idea of YARN in Hadoop is to separate the two main responsibilities of JobTracker and TaskTracker into separate entities.

YARN consists of the following components:

Resource Manager:

  • The resource manager's responsibility is to allocate available resources to applications.

Per-application ApplicationMaster:

  • On one side, the ApplicationMaster communicates with Resource Manager and on the other side with Node Manager. It negotiates resources from Resource Manager and works with the Node Manager to execute and monitor the component tasks.

How does Apache Hadoop YARN work

Node Manager and Container:

Node Manager is a slave that runs on each computer and is in charge of launching the application's containers.

Node Manager and Container

Features

Multitenancy

YARN provides access to multiple data processing engines, such as Batch Processing engines, Stream Processing Engines, Interactive Processing Engines, Graph Processing Engines, etc. This brings the advantage of multi-tenancy to the business.

Cluster Utilization

YARN optimizes a cluster by dynamically using and allocating its resources. YARN is a parallel processing framework for implementing distributed computing clusters that process large amounts of data across multiple computing nodes. Hadoop YARN allows dividing a computing task into hundreds or thousands of tasks.

Compatibility

YARN in Hadoop is also compatible with the first version of Hadoop because it uses existing MapReduce applications. So, YARN can also be used with earlier versions of Hadoop.

Scalability

The YARN scheduler in Hadoop Resource Manager allows thousands of clusters and nodes to be managed and scaled by Hadoop.

YARN Architecture in Hadoop

The architecture of YARN is shown in the figure below. The architecture consists of several components such as Resource Manager, Node Manager and Application Master.

The cluster's Resource Manager and Node Manager are two components in charge of managing and scheduling Hadoop jobs. The execution of tasks in parallel is the responsibility of the Application Master. Its daemon is in charge of carrying out the compute jobs, checking them for errors, and finishing them.

yarn architecture

Main Components of YARN Architecture in Hadoop

Resource Manager

The Resource Manager is the central decision maker for allocating resources among all system applications. When it receives processing requests, it forwards portions of them to the appropriate node managers, where the actual processing takes place. It acts as the cluster's resource arbitrator, allocating available resources to competing applications.

The Resource Manager consists of the following:

  1. Scheduler:
  • It is known as a pure scheduler as it performs no monitoring or application state tracking.
  • If there is a sudden failure in hardware or application failure, it does not guarantee a restart of the failed tasks.
  • The scheduler performs its functions based on the resource requirements of the application. It does this by using the abstraction of resource containers, including memory, CPU, disk, network, and more.
  1. Application Manager:
  • The Application Manager is responsible for collecting job submissions, selecting the first container to run the application-specific ApplicationMaster, and providing services to restart the ApplicationMaster container in case of failure.
  • Each application's ApplicationMaster is responsible for negotiating the appropriate resource containers from the scheduler, maintaining their state and tracking progress.

Node Manager

  • Monitors resource usage (storage, CPU, etc) per container and handles log management.
  • It registers with the Resource Manager and sends out heartbeats containing the health status of the node.
  • Resource Manager assigns the Node Manager to manage all the application containers

Application Master

  • A resource request for a container to perform an application task is sent from the application host to the Resource Manager.
  • Upon receiving the request from the Application Master, the Resource Manager evaluates resource requirements, checks resource availability, and authorizes the container to fulfil the resource request.
  • Once the container is configured, the application host will instruct the Node Manager to use resources and start application-specific activities and also sends health reports to the Resource Manager from time to time.

Application Workflow in Hadoop YARN

Application workflow in Hadoop YARN

  • A client applies.
  • To launch the Application Master, the Resource Manager allows a container.
  • The Resource Manager accepts the Application Master registration.
  • Containers are negotiated by the Application Master with the Resource Manager.
  • The Node Manager receives a request to launch containers from the Application Master.
  • The container is used to run application code.
  • To check on the status of an application, the client contacts the Resource Manager or Application Master.
  • The Application Master deregisters with the Resource Manager once the above processing is finished.

YARN vs MapReduce

YARNMapReduce
AcronymYARN is known as Yet Another Resource Negotiator.MapReduce is self-defined.
Suitable forBoth MapReduce and non-MapReduce applicationsOnly MapReduce applications
Cluster Resource OptimizationExcellence through central resource managementAverage because Map and Reduce slots are fixed
Single Point of FailureBecause YARN contains numerous Masters, if one fails, another Master will pick it up and continue the execution, eliminating the idea of a single point of failure.When compared to YARN, MapReduce has a single point of failure, poor resource utilisation, and limited scalability.

Conclusion

  • YARN, known as Yet Another Resource Negotiator, was introduced in Hadoop version 2.0 by Yahoo and Hortonworks in 2012.
  • YARN in Hadoop was developed to separate the job scheduling and resource allocation processes from the MapReduce engine.
  • The above article covered YARN architecture in Hadoop and its three components which are Resource Manager, Node Manager and Application Master.