Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

1 Answer

0 votes
by (32.3k points)
edited by

With persist(), you can use much different storage level(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY) while with cache(), you can only use the default storage level MEMORY_ONLY.

Caching and persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be used again in subsequent stages. These interim results as RDDs are kept in default memory default or more solid storage like disk and/or replicated. RDDs can be cached using cache operation. They can also be persisted using persist operation. //Give definition first then tell about storage levels


These functions(persist(), cache()) can be used to adjust the storage level of an RDD. When freeing up some memory, Spark will use the storage level identifier to decide which partitions should be kept. The parameterless variants  persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY).

Warning: Once the storage level has been changed, it cannot be changed again!

If you want more information regarding Spark, refer the following video:

Browse Categories