Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Hadoop YARN


May 26, 2021 Hadoop


Table of contents


Hadoop Hadoop - YARN

The old MapReduce architecture

Hadoop YARN

  • JobTracker: Responsible for resource management, tracking resource consumption and availability, job lifecycle management (scheduling job tasks, tracking progress, providing fault tolerance for tasks)
  • TaskTracker: Load or close tasks and report task status on a timed day

This architecture has the following issues:

  1. JobTracker is a centralized processing point for MapReduce with a single point of failure
  2. JobTracker does too many tasks, resulting in too much resource consumption and a lot of memory overhead when MapReduce job is very large. This is also the industry's general conclusion that The MapReduce of old Hadoop can only support the upper limit of 4000 node hosts
  3. On the TaskTracker side, the number of map/reduce tasks as a resource is too simplus to take into account cpu/memory usage, and OOM can easily occur if two large memory-consuming taskes are scheduled together
  4. On the TaskTracker side, forcing resources into map task slots and reduced task slots can result in waste of resources and problems with cluster resource utilization if there is only map task or only reduce task in the system

In general, it's a single point and resource utilization

YARN architecture

Hadoop YARN

Hadoop YARN

YARN is to split JobTracker's responsibilities, resource management and task scheduling monitoring into separate x7ACBs; Process: A global resource management and a management of each job (ApplicationMaster) ResourceManager and NodeManager provide allocation and management of computing resources, while ApplicationMaster completes the operation of the application

  • ResourceManager: Global resource management and task scheduling
  • NodeManager: Resource management and monitoring of individual nodes
  • ApplicationMaster: Resource management and task monitoring for a single job
  • Container: The organization of the resource request and the container in which the task runs

Architectural comparison

Hadoop YARN

Under the YARN architecture, a common resource management platform and a common application computing system #x5E73; to avoid the single point of the old architecture and resource utilization issues, while also making the applications running on it no longer limited to mapReduce

YARN basic process

Hadoop YARN

Hadoop YARN

1. Job submission

Get an Application ID from ResourceManager to check the job output configuration and calculate the input shrapned copy job resources (job jar, profile, shrapned information) to HDFS for the execution of later tasks

2. Job initialization

ResourceManager submits the job to Scheduler (there are many scheduling algorithms, typically based on priority) and Scheduleer assigns a Container to the job, and ResourceManager loads an application master process and hands it over to NodeManager.

Managing ApplicationMaster is primarily about creating a series of monitoring processes to track the progress of the job, getting input shrapned, creating a Map task for each shrapth and the corresponding reduce task Application Master also deciding how to run the job, if the job is small (configurable), directly under the same JVM

3. Task assignment

ApplicationMaster requests resources from Resource Manager (Container one by one, specifying resource requirements for task assignments) that are typically allocated based on data locality

4. Task execution

ApplicationMaster starts Container in the corresponding NodeManager to read the resources required for a task (job jar, profile, etc.) from HDFS, and then performs the task, depending on the allocation of ResourceManager

5. Progress and status update

Report the progress and status of the task to ApplicationMaster Client on a timely schedule to get the progress and status of the entire task from ApplicationMaster

6. Job completion

Client regularly checks whether the entire job is complete When the job is complete, temporary files, directories, and so on are emptied