How to write to CSV in Spark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-07T19:00:48+0000

Since Spark uses Hadoop File System API to write data to files, you just need to add this command:

rdd.saveAsTextFile("dumdata")

It will be saved as "dumdata/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. To provide fault-tolerance, each partition in the RDD is written as a separate file. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-runs the task and overwrite the partially written/corrupted part-00002, with no effects on other parts. If they all wrote to the same file, then it is much harder to recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "dumdata", they will all read all the part-XXXXX files inside dumdata as well.

You can refer to the following video tutorial of Spark which will help you learn Spark from scratch:

How to write to CSV in Spark

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources