Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)
edited by

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?

df =
       .option("header", "true")

1 Answer

0 votes
by (32.3k points)

I would suggest you to use wildcard, e.g. just replace 2019 with *:

(PySpark v2.3):

df =


       .option("header", "true")


Hopefully, this will work fine for you.

Another approach:

(Spark 2.x)For Example, Let's say you have 3 directories holding csv files:

dir1, dir2, dir3

You then define paths as a string of comma delimited list of paths as follows:

paths = "dir1/,dir2/,dir3/*"

Then use the following function and pass this path's variable to it:

def get_df_from_csv_paths(paths):

        df ="csv").option("header", "false").\


            option('delimiter', '\t').\

            option('mode', 'DROPMALFORMED').\


        return df

By then running:

df = get_df_from_csv_paths(paths)

Now, you have a single spark dataframe containing the data from all the CSVs found in these 3 directories. 

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

Browse Categories