0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I'm looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can't find how to do it with DataFrames. persist and cache (which are synonyms for each other) are available for DataFrame but they do not "break the lineage" and are thus unsuitable for methods that could loop for hundreds (or thousands) of iterations.

As an example, suppose that I have a list of functions whose signature is DataFrame => DataFrame. I want to have a way to compute the following even when myfunctions has hundreds or thousands of entries:

def foo(dataset: DataFrame, g: DataFrame => Unit) =
    myfunctions.foldLeft(dataset) {
        case (df, f) =>
            val nextDF = f(df)

1 Answer

0 votes
by (45.2k points)
edited by

TL;DR: For Spark versions up to 1.6, to checkpoint DataFrames, my suggested solution is based on another answer, but with one extra line:

val df2 = sqlContext.createDataFrame(df.rdd, df.schema)
// df2 is checkpointed

If you want to learn PySpark, you can check out the PySpark course by Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !