0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000.

Any ideas how to make my spark saving to file with a specified file name?

1 Answer

0 votes
by (31.4k points)
edited by

Since Spark uses Hadoop File System API to write data to files, you just need to add this command:

rdd.saveAsTextFile("dumdata")

It will be saved as "dumdata/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. To provide fault-tolerance, each partition in the RDD is written as a separate file. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-runs the task and overwrite the partially written/corrupted part-00002, with no effects on other parts. If they all wrote to the same file, then it is much harder to recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "dumdata", they will all read all the part-XXXXX files inside dumdata as well.

You can refer to the following video tutorial of Spark which will help you learn Spark from scratch:

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
asked Oct 5, 2019 in Data Science by ashely (43.2k points)
Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...