0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)
I am trying to write a single CSV, but not able to, it is making a folder.

Need a Scala function which will take parameter like path and file name and write that CSV file.

1 Answer

0 votes
by (31.4k points)
edited by

Spark by default writes CSV file output in multiple parts-*.CSV, that too inside a folder. The reason is simple, it creates multiple files because each partition is saved individually. You can overcome this situation by the following methods.


 

Method 1

If the expected dataframe size is small, you can either use repartition or coalesce to create a single output file as /filename.csv/part-00000.

dataframe 

   .repartition(1) 

   .write

   .mode ("overwrite")

   .format("com.intelli.spark.csv") 

   .option("header", "true") 

   .save("filename.csv")

Repartition(1) will shuffle the data to write everything in one particular partition thus writer cost will be high and it might take a long time if the file size is huge.

Method 2

Coalesce will require a lot of memory, hence it's not a good solution if your file size is huge as you will run out of memory.

dataframe 

   .coalesce(1) 

   .write

   .mode ("overwrite")

   .format("com.intelli.spark.csv") 

   .option("header", "true") 

   .save("filename.csv")

All data will be written to mydata.csv/part-00000.

If you want more information regarding spark, refer the following video tutorial:

If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...