Write single CSV file using spark-csv

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-05T14:45:14+0000

Spark by default writes CSV file output in multiple parts-*.CSV, that too inside a folder. The reason is simple, it creates multiple files because each partition is saved individually. You can overcome this situation by the following methods.

Method 1

If the expected dataframe size is small, you can either use repartition or coalesce to create a single output file as /filename.csv/part-00000.

dataframe
   .repartition(1)
   .write
   .mode ("overwrite")
   .format("com.intelli.spark.csv")
   .option("header", "true")
   .save("filename.csv")

Repartition(1) will shuffle the data to write everything in one particular partition thus writer cost will be high and it might take a long time if the file size is huge.

Method 2

Coalesce will require a lot of memory, hence it's not a good solution if your file size is huge as you will run out of memory.

dataframe
   .coalesce(1)
   .write
   .mode ("overwrite")
   .format("com.intelli.spark.csv")
   .option("header", "true")
   .save("filename.csv")

All data will be written to mydata.csv/part-00000.

If you want more information regarding spark, refer the following video tutorial:

If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.

Write single CSV file using spark-csv

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources