Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.

The function is defined as

def csv(path: String): Unit
    path : the location/folder name and not the file name.


Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.

Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?

Code :

df.coalesce(1).write.csv("sample_path")
 

Current Output :

sample_path
|
+-- part-r-00000.csv

 

Desired Output :

sample_path
|
+-- my_file.csv

by (100 points)
Can you provide the solution for the above question for writing it to adls from Databricks in Pyspark

1 Answer

0 votes
by (32.3k points)

Hadoop File Format is used by Spark and this file format requires data to be partitioned - that's why you have part- files. In order to change filename, try to add something like this in your code:


 

import org.apache.hadoop.fs._;

val fs = FileSystem.get(sc.hadoopConfiguration);

val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName();

fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv"));

fs.delete(new Path("mydata.csv-temp"), true);

or just do:

import org.apache.hadoop.fs._;

val fs = FileSystem.get(sc.hadoopConfiguration());

fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"));

If you want to learn more about Big Data, visit Big Data Tutorial and Big Data Certification by Intellipaat.

Browse Categories

...