DataFrame partitionBy to a single Parquet file (per partition)

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-09T06:03:39+0000

Another way to perform this task is by using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. And Neither increasing that no. helps you, if you do coalesce(5) you get more parallelism, but end up with 5 files per partition.

In order to get one file per partition without using coalesce(), use repartition().

Do like this:

import spark.implicits._
df.repartition($"entity", $"year", $"month", $"day", $"status").write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location")

Once you do this, you will get one parquet file per output partition, instead of multiple files.

If you want to know more about Spark, then do check out this awesome video tutorial:

DataFrame partitionBy to a single Parquet file (per partition)

DataFrame partitionBy to a single Parquet file (per partition)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions