Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

How is the behavior of memory_only and memory_and_disk caching level in spark differ?

1 Answer

0 votes
by (32.3k points)
edited by

From the official documentation:

MEMORY_ONLY

Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK

Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

For Memory ONLY, spark will try to keep partitions in memory always. If some partitions can not be kept in memory, or for node loss some partitions are removed from RAM, spark will recompute using lineage information.

Now, for Memory-And-Disk level, spark will always keep partitions computed and cached. It will try to keep in RAM, but if it does not fit then partitions will be spilled to disk.

Also, I would like you to check this Persistence levels in terms of efficiency(as explained in the documentation):

Level

Space used

CPU Time 

In memory

On disk

Serialized

MEMORY_ONLY

High

Low

Y

N

N

MEMORY_ONLY_SER

Low

High

Y

N

Y

MEMORY_AND_DISK

High

Medium

Some

Some

Some

MEMORY_AND_DISK_SER

Low

High

Some

Some

Y

DISK_ONLY 

Low

High

N

Y

Y

Browse Categories

...