Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000.

Any ideas how to make my spark saving to file with a specified file name?

1 Answer

0 votes
by (32.3k points)
edited by

Since Spark uses Hadoop File System API to write data to files, you just need to add this command:

rdd.saveAsTextFile("dumdata")

It will be saved as "dumdata/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. To provide fault-tolerance, each partition in the RDD is written as a separate file. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-runs the task and overwrite the partially written/corrupted part-00002, with no effects on other parts. If they all wrote to the same file, then it is much harder to recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "dumdata", they will all read all the part-XXXXX files inside dumdata as well.

You can refer to the following video tutorial of Spark which will help you learn Spark from scratch:

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
asked Oct 5, 2019 in Data Science by ashely (50.2k points)
Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.5k answers

500 comments

108k users

Browse Categories

...