Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
5 views
in Big Data Hadoop & Spark by (11.4k points)

How to read partitioned parquet with condition as dataframe,

this works fine,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")
Partition is there for day=1 to day=30 is it possible to read something like(day = 5 to 6) or day=5,day=6,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")


If I put * it gives me all 30 days data and it is too big.

1 Answer

0 votes
by (32.3k points)

In old versions(say Spark<1.6) sqlContext.read.parquet can take multiple paths as input.

As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. Now, in order to create a dataframe with the columns "data", "year", "month" and "day", where you want day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext

     .read

     .option("basePath", "file:///your/path/")

     .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 

                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

1.4k questions

32.9k answers

507 comments

693 users

...