What is the difference between cache and persist?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-05T15:19:35+0000

With persist(), you can use much different storage level(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY) while with cache(), you can only use the default storage level MEMORY_ONLY.

Caching and persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be used again in subsequent stages. These interim results as RDDs are kept in default memory default or more solid storage like disk and/or replicated. RDDs can be cached using cache operation. They can also be persisted using persist operation. //Give definition first then tell about storage levels

These functions(persist(), cache()) can be used to adjust the storage level of an RDD. When freeing up some memory, Spark will use the storage level identifier to decide which partitions should be kept. The parameterless variants persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY).

Warning: Once the storage level has been changed, it cannot be changed again!

If you want more information regarding Spark, refer the following video:

What is the difference between cache and persist?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources