Spark Streaming fault tolerance semantics

May 17, 2021 Spark Programming guide

1. Spark Streaming fault tolerance semantics

Spark Streaming fault tolerance semantics

In this section, we'll discuss spark Streaming's behavior in the event of a node error event. To understand this, let's first remember some of spark RDD's basic fault tolerance semantics.

An RDD is an immeascondable, determinable, distributed data set. Each RDD remembers a lineage of determinative operations, which is used to create the RDD on fault- wearing input datasets.
If a partition of any RDD is lost due to a node failure, this partition can be recalculated by operating the genealogy from the source fault-06ged data set.
Assuming that all RDD transforms are OK, the data that is eventually converted is the same, regardless of the error in the Spark machine.

Spark runs on data from fault-07 between systems such as HDFS or S3. T herefore, any RDD that comes from fault-0100 data is fault-0. H owever, this is not the case with Spark Streaming, because Spark Streaming's data is mostly derived from the network. In order to obtain the same fault- built RDD fault tolerance, the received data needs to be repeatedly saved on multiple Spark executors of the worker node (the default replication factor is 2), which results in two types of data that need to be recovered when an error event occurs

Data received and replicated: In a single worker node failure, this data survives because another node holds a copy of the data.
Data received but buffered for replication: Because there is no duplicate save, the only way to recover the data is to re-read the data from the source.

There are two kinds of mistakes we need to be concerned about

Worker node failure: Any worker node running executor can fail, so that all memory data in that node is lost. If any of the receivers run on the wrong nodes, their cached data will be lost
Driver node failure: If the Drive node running the Spark Streaming application fails, it is clear that SparkContext will be lost and all executors executed on it will be lost.

File semantics as input source (Semantics with files as input source)

If all the input data exists in a fault-00s file system such as HDFS, Spark Streaming can always recover and execute all the data from any error. This gives a exactly one-time semantic, that is, no matter what fails, all the data will be processed exactly once.

Input source semantics based on receiver

For receiver-based input sources, fault- tolerance semantics depend on both the failure case and the type of receiver. As discussed earlier, there are two types of receiver

Reliable Receiver: These receivers will only inform reliable sources after ensuring that the data is replicated. I f such a receiver fails, buffered (non-replicated) data is not acknowledged by the source. If the receiver restarts, the source reseeds the data, so no data is lost.
Unreliable Receiver: When a worker or driver node fails, this receiver loses data

Choosing which type of receiver depends on these semantics. I f a worker node fails, Reliable Receiver does not lose data, and Unreliable Receiver loses data that is received but not replicated. If the driver node fails, all data previously received and copied into memory will be lost, except in the case above, which can affect the results of stately transformation.

To avoid losing data received in the past, Spark 1.2 write ahead logs data into a fault- losing storage system. With write ahead logs and Real Receiver, we can do zero data loss and acty-once semantics.

The following table summarizes the error semantics:

Deployment Scenario	Worker Failure	Driver Failure
Spark 1.1 or earlier, spark 1.2 without the white ahead log	Buffer data loss in the case of Unreliable Receiver, and zero data loss in the case of Real Receiver and files	Buffer data loss in the case of Unreliable Receiver, data loss in the past in all cases, and zero data loss in the case of files
Spark 1.2 with white ahead log	In the case of Reliable Receiver and files, zero data loss	In the case of Reliable Receiver and files, zero data loss

The semantics of the output operation

Depending on the genealogy of the operation it determines, all the data is modeled as RDD, and all recalculations produce the same results. A ll DStream transformations have acty-once semantics. T hat is, even if a worker node fails, the final conversion result is the same. H owever, output foreachRDD have at-least once semantics, which means that in the event of a worker event failure, the transformed data may be written to an external entity more than once. With saveAs***Files HDFS, it is acceptable to write more than once (because the file will be overwritten with the same data).

Spark Streaming fault tolerance semantics

Table of contents