May 17, 2021 Spark Programming guide
There are several optimizations in Spark that can reduce batch time. T hese can be discussed in the Optimization Guide. This section focuses on a few important areas.
Receiving data over a network (e.g. kafka, flume, socket, etc.) requires that the data be counterserated and saved to Spark. I
f data reception becomes a bottleneck in the system, consider receiving data in parallel. N
ote that each input DStream creates
receiver
(running on a worker machine) that receives a single stream of data. C
reate multiple input DStreams and configure that they can receive different partitions of data streams from the source for multi-stream reception. F
or example, a single input DStream that receives two topic data can be sliced into two kafka input streams, one for each topic. T
his will run two receivers on both
receiver
allowing data to be received in parallel, increasing overall throughput.
Multiple DStreams can be combined to generate a single DStream, so that the transformation operations used in a single input DStream can be applied to the merged DStream.
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
Another parameter to consider is
receiver
of the receiver. F
or most
receiver
the received data is merged into a large block of data before it is stored in Spark memory. T
he number of blocks in each batch of data determines the number of tasks. T
hese tasks are data received with map-like transformation operations.
The blocking interval is determined
spark.streaming.blockInterval
and the default value is 200 milliseconds.
The optional option for a
receiver
stream or a multi-receiver is to explicitly reallocate the input data stream
inputStream.repartition(<number of partitions>)
and distribute the received batch data through the number of machines in the cluster before further operation.
If the number of synth tasks running on the compute stage is not large enough, the resources of the cluster are not fully utilized. F
or example, for distributed reduce operations such as
reduceByKey
and
reduceByKeyAndWindow
the default number of parallel tasks is determined by configuring properties (configuration.html s spark-properties)
spark.default.parallelism
You can pass
PairDStreamFunctions
(api/scala/index.html.org.apache.spark.streaming.dstream.PairDStreamFunctions), or set
spark.default.parallelism
modify the default values.
The total cost of data serialization is usually large, especially when sub-second-level batch data is received. There are two related points:
The number of tasks started per second is very large (50 or more). S ending tasks to slave takes a significant amount of time, making it difficult for requests to get subsecond-level responses. You can reduce your expenses with the changes below
These changes may reduce batch processing time by 100s of milliseconds, thus allowing sub-second batch size to be viable.