I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000.

Any ideas how to make my spark saving to file with a specified file name?

Since Spark uses Hadoop File System API to write data to files, you just need to add this command:


It will be saved as "dumdata/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. To provide fault-tolerance, each partition in the RDD is written as a separate file. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-runs the task and overwrite the partially written/corrupted part-00002, with no effects on other parts. If they all wrote to the same file, then it is much harder to recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "dumdata", they will all read all the part-XXXXX files inside dumdata as well.

