Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)
edited by

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?

df = sqlContext.read
       .format("dashdash.csv")
       .option("header", "true")
       .load("../Downloads/2019.csv"
)

1 Answer

0 votes
by (32.3k points)

I would suggest you to use wildcard, e.g. just replace 2019 with *:

(PySpark v2.3):

df = sqlContext.read

       .format("dash.csv")

       .option("header", "true")

       .load("../Downloads/*.csv")

Hopefully, this will work fine for you.

Another approach:

(Spark 2.x)For Example, Let's say you have 3 directories holding csv files:

dir1, dir2, dir3

You then define paths as a string of comma delimited list of paths as follows:

paths = "dir1/,dir2/,dir3/*"

Then use the following function and pass this path's variable to it:

def get_df_from_csv_paths(paths):

        df = spark.read.format("csv").option("header", "false").\

            schema(custom_schema).\

            option('delimiter', '\t').\

            option('mode', 'DROPMALFORMED').\

            load(paths.split(','))

        return df

By then running:

df = get_df_from_csv_paths(paths)

Now, you have a single spark dataframe containing the data from all the CSVs found in these 3 directories. 

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
...