Hadoop Failover

May 26, 2021 Hadoop

1. YARN - Failover

2. The type of failure

3. Failed processing

YARN - Failover

The type of failure

Program issues
The process crashed
Hardware issues

Failed processing

The task failed

Run-time exceptions or JVM exits are reported to The Appmaster
Check the suspended task (timeout) by heartbeat, and check multiple (configurable) times to determine if the task is invalid
If a job's task failure rate exceeds the configuration, the job is considered to have failed
Failed tasks or jobs will have The Appmaster re-run

ApplicationMaster failed

ApplicationMaster sends a heartbeat signal to ResourceManager on a timely day, which is usually considered a failure if the Appmaster fails, but can also fail after multiple configurations
Once The Appmaster fails, ResourceManager launches a new Appmaster
The new ApplicationMaster is responsible for restoring the previously incorrect State of the Applicity Master (yarn.app.mapreduce.am.job.recovery.enable.true), a step achieved by saving the app's running state to shared storage, and ResourceManager is not responsible for saving and restoring the task state
Client also queries the Appmaster on a timely schedule for progress and status, and asks Resouce Managementr about the new Appmaster if it finds it to have failed

NodeManager failed

NodeManager sends a heartbeat to ResourceManager on a timely day, and If you don't receive a heartbeat message for more than a while, ResourceManager removes it
Any tasks running on that NodeManager and ApplicityMaster will be restored on other NodeManager
If a NodeManager fails too many times, ApplicationMaster blacklists it (ResourceManager does not), and the task is scheduled without running on it

ResourceManager failed

Through the checkpoint mechanism, it is timed to save its state to disk and then re-run when it fails
Synchronize states and achieve transparent HA through zookeyer

As you can see, general error handling is monitored (heartbeat) and recovered by the parent module of the current module. The Top Module, on The Other Hand, Implements Ha Through Timed Save, Synchronization Status, ֹ zookeeper