May 17, 2021 Spark Programming guide
In this section, we'll discuss spark Streaming's behavior in the event of a node error event. To understand this, let's first remember some of spark RDD's basic fault tolerance semantics.
Spark runs on data from fault-07 between systems such as HDFS or S3. T herefore, any RDD that comes from fault-0100 data is fault-0. H owever, this is not the case with Spark Streaming, because Spark Streaming's data is mostly derived from the network. In order to obtain the same fault- built RDD fault tolerance, the received data needs to be repeatedly saved on multiple Spark executors of the worker node (the default replication factor is 2), which results in two types of data that need to be recovered when an error event occurs
There are two kinds of mistakes we need to be concerned about
If all the input data exists in a fault-00s file system such as HDFS, Spark Streaming can always recover and execute all the data from any error. This gives a exactly one-time semantic, that is, no matter what fails, all the data will be processed exactly once.
For receiver-based input sources, fault- tolerance semantics depend on both the failure case and the type of receiver. As discussed earlier, there are two types of receiver
Choosing which type of receiver depends on these semantics. I f a worker node fails, Reliable Receiver does not lose data, and Unreliable Receiver loses data that is received but not replicated. If the driver node fails, all data previously received and copied into memory will be lost, except in the case above, which can affect the results of stately transformation.
To avoid losing data received in the past, Spark 1.2
write ahead logs
data into a fault- losing storage system.
With
write ahead logs
and Real Receiver, we can do zero data loss and acty-once semantics.
The following table summarizes the error semantics:
Deployment Scenario | Worker Failure | Driver Failure |
---|---|---|
Spark 1.1 or earlier, spark 1.2 without the white ahead log | Buffer data loss in the case of Unreliable Receiver, and zero data loss in the case of Real Receiver and files | Buffer data loss in the case of Unreliable Receiver, data loss in the past in all cases, and zero data loss in the case of files |
Spark 1.2 with white ahead log | In the case of Reliable Receiver and files, zero data loss | In the case of Reliable Receiver and files, zero data loss |
Depending on the genealogy of the operation it determines, all the data is modeled as RDD, and all recalculations produce the same results. A
ll DStream transformations have acty-once semantics. T
hat is, even if a worker node fails, the final conversion result is the same. H
owever, output
foreachRDD
have
at-least once
semantics, which means that in the event of a worker event failure, the transformed data may be written to an external entity more than once.
With
saveAs***Files
HDFS, it is acceptable to write more than once (because the file will be overwritten with the same data).