Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed?

1 Answer

0 votes
by (32.3k points)

Yes, Spark do unpersist the RDD when garbage is collected.

In RDD.persist you can see:

sc.cleaner.foreach(_.registerRDDForCleanup(this))

This leads to occurence of WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there:

sc.unpersistRDD(rddId, blocking)

For more context you can have a look on ContextCleaner in general and the commit that added it.

Whenever there is a case of relying on garbage collection for unperisting RDDs, you should be aware of the following things:

  • The RDDs use resources on the executors, and the garbage collection happens on the driver. The RDD won’t be unpersisted automatically until there is enough memory pressure on the driver, no matter how full the disk/memory of the executors gets.

  • Also, you cannot unpersist a part of an RDD i.e. some partitions or records. If in case you build one persisted RDD from another, both will have to fit entirely on the executors at the same time.

Browse Categories

...