Reading DataFrame from partitioned parquet file

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-28T07:12:50+0000

In old versions(say Spark<1.6) sqlContext.read.parquet can take multiple paths as input.

As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. Now, in order to create a dataframe with the columns "data", "year", "month" and "day", where you want day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext
     .read
     .option("basePath", "file:///your/path/")
     .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

Reading DataFrame from partitioned parquet file

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources