Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

DStreams cache or persist


May 17, 2021 Spark Programming guide


Table of contents


Cache or persist

Similar to RDD, DStreams also allows developers to persist streaming data into memory. U sing the persist() automatically persists the RDD in DStream into memory. T his is useful if the data in DStream needs to be calculated more than once. State-based operations such as reduceByWindow reduceByKeyAndWindow updateStateByKey persistence is the default and does not require developers to call the persist() method.

For example, for input traffic obtained over a network (e.g. kafka, flume, etc.), the default persistence policy is to copy the data to two different nodes to fault tolerance.

Note that unlike RDD, the default persistence level for DStreams is to store serialized data into memory, which is described in the Performance Tuning section. For more information, see rdd persistence