Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this:

df.coalesce(1).write.partitionBy("entity", "year", "month", "day", "status")

I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.

Is there a better way to do this using the standard Spark SQL API?

1 Answer

0 votes
by (32.3k points)
edited by

Another way to perform this task is by using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. And Neither increasing that no. helps you, if you do coalesce(5) you get more parallelism, but end up with 5 files per partition.

In order to get one file per partition without using coalesce(), use repartition().

Do like this:

import spark.implicits._

df.repartition($"entity", $"year", $"month", $"day", $"status").write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location")

Once you do this, you will get one parquet file per output partition, instead of multiple files.

If you want to know more about Spark, then do check out this awesome video tutorial:

Browse Categories