YARN - Failover
The type of failure
-
Program issues
-
The process crashed
-
Hardware issues
Failed processing
The task failed
-
Run-time exceptions or JVM exits are reported to The Appmaster
-
Check the suspended task (timeout) by heartbeat, and check multiple (configurable) times to determine if the task is invalid
-
If a job's task failure rate exceeds the configuration, the job is considered to have failed
-
Failed tasks or jobs will have The Appmaster re-run
ApplicationMaster failed
-
ApplicationMaster sends a heartbeat signal to ResourceManager on a timely day, which is usually considered a failure if the Appmaster fails, but can also fail after multiple configurations
-
Once The Appmaster fails, ResourceManager launches a new Appmaster
-
The new ApplicationMaster is responsible for restoring the previously incorrect State of the Applicity Master (yarn.app.mapreduce.am.job.recovery.enable.true), a step achieved by saving the app's running state to shared storage, and ResourceManager is not responsible for saving and restoring the task state
-
Client also queries the Appmaster on a timely schedule for progress and status, and asks Resouce Managementr about the new Appmaster if it finds it to have failed
NodeManager failed
-
NodeManager sends a heartbeat to ResourceManager on a timely day, and If you don't receive a heartbeat message for more than a while, ResourceManager removes it
-
Any tasks running on that NodeManager and ApplicityMaster will be restored on other NodeManager
-
If a NodeManager fails too many times, ApplicationMaster blacklists it (ResourceManager does not), and the task is scheduled without running on it
ResourceManager failed
-
Through the checkpoint mechanism, it is timed to save its state to disk and then re-run when it fails
-
Synchronize states and achieve transparent HA through zookeyer
As you can
see, general error handling is monitored (heartbeat) and recovered by the parent module of the current module.
The Top Module, on The Other Hand, Implements Ha Through Timed Save, Synchronization Status, ΦΉ zookeeper