How to import multiple csv files in a single load?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-11T11:23:04+0000

I would suggest you to use wildcard, e.g. just replace 2019 with *:

(PySpark v2.3):

df = sqlContext.read
.format("dash.csv")
.option("header", "true")
.load("../Downloads/*.csv")

Hopefully, this will work fine for you.

Another approach:

(Spark 2.x)For Example, Let's say you have 3 directories holding csv files:

dir1, dir2, dir3

You then define paths as a string of comma delimited list of paths as follows:

paths = "dir1/,dir2/,dir3/*"

Then use the following function and pass this path's variable to it:

def get_df_from_csv_paths(paths):
df = spark.read.format("csv").option("header", "false").\
schema(custom_schema).\
option('delimiter', '\t').\
option('mode', 'DROPMALFORMED').\
load(paths.split(','))
return df

By then running:

df = get_df_from_csv_paths(paths)

Now, you have a single spark dataframe containing the data from all the CSVs found in these 3 directories.

How to import multiple csv files in a single load?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources