Checkpointing saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD's lineage completely. As the driver restarts the recovery takes place.
There are two types of data that we checkpoint in Spark:
Metadata Checkpointing – Metadata means the data about data. It refers to saving the metadata to fault tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches. Configuration refers to the configuration used to create streaming DStream operations are operations which define the steaming application. Incomplete batches are batches which are in the queue but are not complete.
Data Checkpointing –: It refers to save the RDD to reliable storage because its need arises in some of the stateful transformations. It is in the case when the upcoming RDD depends on the RDDs of previous batches. Because of this, the dependency keeps on increasing with time. Thus, to avoid such increase in recovery time the intermediate RDDs are periodically checkpointed to some reliable storage. As a result, it cuts down the dependency chain.
To set the Spark checkpoint directory call: SparkContext.setCheckpointDir(directory: String)
3. Types of Checkpointing in Apache Spark
There are two types of Apache Spark checkpointing:
Reliable Checkpointing – It refers to that checkpointing in which the actual RDD is saved in reliable distributed file system, e.g. HDFS. To set the checkpoint directory call: SparkContext.setCheckpointDir(directory: String). When running on the cluster the directory must be an HDFS path since the driver tries to recover the checkpointed RDD from a local file. While the checkpoint files are actually on the executor’s machines.
Local Checkpointing: In this checkpointing, in Spark Streaming or GraphX we truncate the RDD lineage graph in Spark. In this, the RDD is persisted to local storage in the executor.
Apache Spark Quiz
Difference between Spark Checkpointing and Persist
Checkpointing stores the RDD in HDFS. It deletes the lineage which created it.
When we persist RDD with DISK_ONLY storage level the RDD gets stored in a location where the subsequent use of that RDD will not reach that point in recomputing the lineage.
On completing the job run unlike cache the checkpoint file is not deleted.
When we are checkpointing an RDD it results in double computation. The operation will first call a cache before accomplishing the actual job of computing. Secondly, it is written to checkpointing directory.
After persist() is called, Spark remembers the lineage of the RDD even though it doesn’t call it.
Secondly, after the job run is complete, the cache is cleared and the files are destroyed.
Checkpointing is reliable.
Persisting is unreliable.