May 17, 2021 Spark Programming guide
One of Spark's most important features is that it can persist (or cache) a collection into memory through various operations. W hen you persist an RDD, each node stores all partitioned data that participates in the calculation into memory, and the data can be reused by the action of this collection (and other collections derived from this collection). T his ability makes subsequent movements faster (usually more than 10 times faster). Caching is a key tool for iterative algorithms and fast interaction.
You can
persist()
through the
cache()
or cache() method. F
irst, the rdd is calculated in action;
Spark's cache is a fault-between technique - if any partition of the RDD is lost, it can be automatically recalculated and created by the original transformation operation.
In addition, we can store each persistent RDD at different storage levels. F
or example, it allows us to persist collections to disk, persist collections as serialized Java objects into memory, copy collections between nodes, or store
collections to Tachyon.
W
e can set these storage
StorageLevel
object to the
persist()
method.
cache()
method uses the default storage
StorageLevel.MEMORY_ONLY
The full storage level is described below:
Storage Level | Meaning |
---|---|
MEMORY_ONLY | Store RDD as a non-serialized Java object in jvm. I f RDDs are not suitable for memory, some partitions will not be cached, requiring them to be recalculated each time they are needed. This is the system's default storage level. |
MEMORY_AND_DISK | Store RDD as a non-serialized Java object in jvm. If RDDs are not suitable for memory, store these partitions that do not fit in memory on disk and read them out every time you need them. |
MEMORY_ONLY_SER | Store RDD as a serialized Java object (one byte array per partition). This approach is more space-saving than non-serialization, especially when using fast serialization tools, but it is more cpu-intensive . |
MEMORY_AND_DISK_SER | Similar MEMORY_ONLY_SER, but instead of repeatedly calculating partitions that are not intended to be stored in memory every time you need them, store them on disk. |
DISK_ONLY | Only the RDD partition is stored on disk |
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. | It is similar to the storage level above, but replicates each partition to the two nodes of the cluster |
OFF_HEAP (experimental) | Store RDD to Tachyon in a serialized format. T his MEMORY_ONLY_SER cost OFF_HEAP garbage collection, allowing smaller performers to share memory pools, compared to the cost of garbage collection. This makes it more attractive in environments with large amounts of memory or multiple complex applications. |
NOTE: In python, the objects stored are serialized through the Pickle library, so it doesn't matter if you choose a serialization level.
Spark also automatically persists intermediate data in some shuffle operations,
reduceByKey
even if the user does not call the
persist
method. T
he benefit is that the entire input needs to be counted repeatedly in the event of a shuffle error.
If the user plans to reuse the RDD generated during the calculation, we still recommend that the user
persist
method.
Spark's multiple storage levels mean different trade-offs between memory utilization and cpu utilization efficiency. We recommend choosing the appropriate storage level through the following procedure:
If your RDD is appropriate for the default storage level (MEMORY_ONLY), select the default storage level. Because this is the option with the highest cpu utilization, it makes the operation on the RDD as fast as possible.
If the default level is not appropriate, select MEMORY_ONLY_SER. Choosing a faster serialized library increases the spatial usage of objects, but still provides fairly fast access.
Unless functions calculate RDDs at a high cost or they need to filter large amounts of data, do not store the RDD on disk, and recalculating a partition is as slow as reading data on a heavy disk.
If you want faster error recovery, you can take advantage of the duplicate storage level. All storage levels can support complete fault tolerance by repeatedly calculating lost data, but duplicate data allows you to continue running tasks on the RDD without having to repeatedly calculate lost data.
In an environment with a lot of memory or a multi-application environment, OFF_HEAP have the following advantages:
Spark automatically monitors the usage of each node's cache, using the recent least-used principle to delete old data.
If you want to manually delete RDD,
RDD.unpersist()
method