Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)
I'm looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can't find how to do it with DataFrames. persist and cache (which are synonyms for each other) are available for DataFrame but they do not "break the lineage" and are thus unsuitable for methods that could loop for hundreds (or thousands) of iterations.

1 Answer

0 votes
by (32.3k points)

Try to do

 

sc.setCheckpointDir("/DIR")

df.rdd.checkpoint

And then you will have to perform your action on the underlying df.rdd. Calling df.ACTION will not work currently, only df.rdd.ACTION

Spark 2.1 added Dataset.checkpoint (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L543). So we do not need to manually checkpoint edges in CC. Need to verify performance are not affected.

Browse Categories

...