Remember

Register

All Courses Ask a Question

Questions
Unanswered
Ask a Question
Blog
Tutorials
Interview Questions

Back

Login

Explore Courses Blog Tutorials Interview Questions

Home
Community
Big Data Hadoop & Spark
How to checkpoint DataFrames?

How to checkpoint DataFrames?

How to checkpoint DataFrames?

0 votes

2 views

asked Jul 17, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I'm looking for a way to checkpoint DataFrames. Checkpoint is currently an operation on RDD but I can't find how to do it with DataFrames. persist and cache (which are synonyms for each other) are available for DataFrame but they do not "break the lineage" and are thus unsuitable for methods that could loop for hundreds (or thousands) of iterations.

apache-spark

Please log in to add a comment.

Please log in to answer this question.

1 Answer

0 votes

answered Jul 17, 2019 by Amit Rawat (32.3k points)

Try to do

sc.setCheckpointDir("/DIR")
df.rdd.checkpoint

And then you will have to perform your action on the underlying df.rdd. Calling df.ACTION will not work currently, only df.rdd.ACTION

Spark 2.1 added Dataset.checkpoint (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L543). So we do not need to manually checkpoint edges in CC. Need to verify performance are not affected.

Please log in to add a comment.

Related questions

0 votes

1 answer

What is the difference between spark checkpoint and persist to a disk

asked Jul 10, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

apache-spark

0 votes

0 answers

how to get latest record from below two dataframes using spark scala?

asked Jun 24, 2021 in Big Data Hadoop & Spark by narayana (120 points)

apache-spark
scala

0 votes

1 answer

How to obtain the difference between two DataFrames?

asked Jul 23, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

apache-spark

0 votes

1 answer

How to query JSON data column using Spark DataFrames?

asked Jul 11, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

apache-spark

0 votes

1 answer

How to perform union on two DataFrames with different amounts of columns in spark?

asked Jul 8, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

apache-spark

31k questions

32.9k answers

507 comments

693 users

Browse By Domains

Data Science Courses Big Data Analytics Courses Business Intelligence Courses Salesforce Courses Cloud Computing Courses Digital Marketing Courses AI & Machine Learning Courses Programming Courses Database Courses Project Management Courses Cyber Security and Ethical Hacking Courses Web Development Courses Software Testing Courses Automation Courses Job Oriented Courses Degree Courses

Popular Courses

Data Science Course Artificial Intelligence Course Data Analytics Course Machine Learning Course Python Data Science Course Business Analytics Course Python Course Azure Course DevOps Course Cyber Security Course AWS Solutions Architect Salesforce Course Selenium Course AWS DevOps Course Ethical Hacking Course Power BI Course Digital Marketing Course Business Analyst Course Investment Banking Course Azure DevOps Course Azure Data Engineer Course Electric Vehicle Course UI UX Design Course SQL Course Full Stack Developer Course Data Engineering Course Supply Chain Management Course General Management Course Product Management Course

Popular Tutorials

Data Science Tutorial Machine Learning Tutorial Cyber Security Tutorial Salesforce Tutorial AWS Tutorial Azure Tutorial SQL Tutorial Selenium Tutorial Ethical Hacking Tutorial Artificial Intelligence Tutorial

Popular Resources

Data Science Machine Learning AWS Digital Marketing Cyber Security Python Interview Questions and Answers SQL Interview Questions and Answers Data Science Interview Questions and Answers PHP Interview Questions and Answers Azure DevOps Interview Questions and Answers

About Us
Media
Privacy Policy
Terms of Use
Contact Us
Blog
Interview Questions
Tutorials
Become an Instructor

© COPYRIGHT 2011-2024 INTELLIPAAT.COM. ALL RIGHTS RESERVED.

...