Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark Streaming reduces the execution time of batch data


May 17, 2021 Spark Programming guide


Table of contents


Spark Streaming reduces the execution time of batch data

There are several optimizations in Spark that can reduce batch time. T hese can be discussed in the Optimization Guide. This section focuses on a few important areas.

The parallel level at which the data is received

Receiving data over a network (e.g. kafka, flume, socket, etc.) requires that the data be counterserated and saved to Spark. I f data reception becomes a bottleneck in the system, consider receiving data in parallel. N ote that each input DStream creates receiver (running on a worker machine) that receives a single stream of data. C reate multiple input DStreams and configure that they can receive different partitions of data streams from the source for multi-stream reception. F or example, a single input DStream that receives two topic data can be sliced into two kafka input streams, one for each topic. T his will run two receivers on both receiver allowing data to be received in parallel, increasing overall throughput. Multiple DStreams can be combined to generate a single DStream, so that the transformation operations used in a single input DStream can be applied to the merged DStream.

val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()

Another parameter to consider is receiver of the receiver. F or most receiver the received data is merged into a large block of data before it is stored in Spark memory. T he number of blocks in each batch of data determines the number of tasks. T hese tasks are data received with map-like transformation operations. The blocking interval is determined spark.streaming.blockInterval and the default value is 200 milliseconds.

The optional option for a receiver stream or a multi-receiver is to explicitly reallocate the input data stream inputStream.repartition(<number of partitions>) and distribute the received batch data through the number of machines in the cluster before further operation.

The parallel level of data processing

If the number of synth tasks running on the compute stage is not large enough, the resources of the cluster are not fully utilized. F or example, for distributed reduce operations such as reduceByKey and reduceByKeyAndWindow the default number of parallel tasks is determined by configuring properties (configuration.html s spark-properties) spark.default.parallelism You can pass PairDStreamFunctions (api/scala/index.html.org.apache.spark.streaming.dstream.PairDStreamFunctions), or set spark.default.parallelism modify the default values.

Data serialization

The total cost of data serialization is usually large, especially when sub-second-level batch data is received. There are two related points:

  • Serialization of RDD data in Spark. R efer to spark optimization guidelines for data serialization. Note that, unlike Spark, the default RDD is persisted into a serialized array of bytes to reduce garbage collection-related pauses.
  • The serialization of the input data. T he data obtained from the outside is stored in Spark, and the obtained byte data needs to be reserated from the byte and then reserated to Spark in the serialized format of Spark. Therefore, the cost of entering data for antiseration can be a bottleneck.

The start-up expenses for the task

The number of tasks started per second is very large (50 or more). S ending tasks to slave takes a significant amount of time, making it difficult for requests to get subsecond-level responses. You can reduce your expenses with the changes below

  • Task serialization. Running kyro serializes anything that reduces the size of the task, reducing the time it takes for the task to be sent to the slave.
  • Execution mode. R unning Spark in Standalone mode or coarse-grained Mesos mode can get a shorter task start time than running Spark in fine-grained Mesos mode. You can get more information by running Spark under Mesos.

These changes may reduce batch processing time by 100s of milliseconds, thus allowing sub-second batch size to be viable.