Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Spark Streaming sets the correct batch capacity


May 17, 2021 Spark Programming guide


Table of contents


Set the correct batch capacity

In order for the Spark Streaming application to run steadily in the cluster, the system should be able to process the received data at sufficient speed (that is, the processing speed should be greater than or equal to the speed at which the received data is received). T his can be seen through the stream's network UI. The batch time should be less than the batch interval time.

Depending on the nature of the flow calculation, batch interval times can significantly affect the data processing rate, which can be maintained through the application. C onsider WordCountNetwork where the system might print word counts every 2 seconds (batch intervals are 2 seconds) for a specific data processing rate, but not every 500 milliseconds. Therefore, in order to maintain the desired data processing rate in a production environment, you should set the appropriate batch interval time (that is, the capacity of the batch data).

A good way to find the right batch capacity is to test your application with a conservative batch interval (5-10 seconds) and a low data rate. T o verify that your system can meet the data processing rate, you can tell by checking the end-to-end latency values (you can view "Total delay" in the Spark driver's log4j log or take advantage of the StreamingListener interface). I f the delay remains stable, the system is stable. I f the latency continues to grow, the system cannot keep up with the data processing rate and is unstable. Y ou can try to increase the data processing rate or reduce batch capacity for further testing. Note that it may be normal for the delay to increase instantaneously due to an instantaneous increase in data processing speed, as long as the delay returns to a low value (less than batch capacity).