Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Hadoop Failover


May 26, 2021 Hadoop


Table of contents


YARN - Failover

The type of failure

  1. Program issues
  2. The process crashed
  3. Hardware issues

Failed processing

The task failed

  1. Run-time exceptions or JVM exits are reported to The Appmaster
  2. Check the suspended task (timeout) by heartbeat, and check multiple (configurable) times to determine if the task is invalid
  3. If a job's task failure rate exceeds the configuration, the job is considered to have failed
  4. Failed tasks or jobs will have The Appmaster re-run

ApplicationMaster failed

  1. ApplicationMaster sends a heartbeat signal to ResourceManager on a timely day, which is usually considered a failure if the Appmaster fails, but can also fail after multiple configurations
  2. Once The Appmaster fails, ResourceManager launches a new Appmaster
  3. The new ApplicationMaster is responsible for restoring the previously incorrect State of the Applicity Master (yarn.app.mapreduce.am.job.recovery.enable.true), a step achieved by saving the app's running state to shared storage, and ResourceManager is not responsible for saving and restoring the task state
  4. Client also queries the Appmaster on a timely schedule for progress and status, and asks Resouce Managementr about the new Appmaster if it finds it to have failed

NodeManager failed

  1. NodeManager sends a heartbeat to ResourceManager on a timely day, and If you don't receive a heartbeat message for more than a while, ResourceManager removes it
  2. Any tasks running on that NodeManager and ApplicityMaster will be restored on other NodeManager
  3. If a NodeManager fails too many times, ApplicationMaster blacklists it (ResourceManager does not), and the task is scheduled without running on it

ResourceManager failed

  1. Through the checkpoint mechanism, it is timed to save its state to disk and then re-run when it fails
  2. Synchronize states and achieve transparent HA through zookeyer

As you can see, general error handling is monitored (heartbeat) and recovered by the parent module of the current module. The Top Module, on The Other Hand, Implements Ha Through Timed Save, Synchronization Status, ΦΉ zookeeper