How to partition and write DataFrame in Spark without deleting partitions with no new data?

Question

asked Jul 17, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this:

dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path)

As mentioned in this question, partitionBy will delete the full existing hierarchy of partitions at path and replaced them with the partitions in dataFrame. Since new incremental data for a particular day will come in periodically, what I want is to replace only those partitions in the hierarchy that dataFrame has data for, leaving the others untouched.

To do this it appears I need to save each partition individually using its full path, something like this:

singlePartition.write.mode(SaveMode.Overwrite).parquet(path + "/eventdate=2017-01-01/hour=0/processtime=1234567890")

However I'm having trouble understanding the best way to organize the data into single-partition DataFrames so that I can write them out using their full path. One idea was something like:

dataFrame.repartition("eventdate", "hour", "processtime").foreachPartition ...
But foreachPartition operates on an Iterator[Row] which is not ideal for writing out to Parquet format.

I also considered using a select...distinct eventdate, hour, processtime to obtain the list of partitions, and then filtering the original data frame by each of those partitions and saving the results to their full partitioned path. But the distinct query plus a filter for each partition doesn't seem very efficient since it would be a lot of filter/write operations.

I'm hoping there's a cleaner way to preserve existing partitions for which dataFrame has no data?

1 Answer

Amit Rawat · Answer 1 · 2019-07-17T14:26:59+0000

In your case the mode option “Append“ can help you out.

df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)

After testing this and saw that this will keep the existing partition files. However, the problem this time is that if you run the same code twice (with the same data), then it will create new parquet files instead of replacing the existing ones for the same data (Spark 1.6). So, instead of using Append, we can still solve this problem with Overwrite. Instead of overwriting at the table level, we should overwrite at the partition level.

df.write.mode(SaveMode.Overwrite)
.parquet("/data/hive/warehouse/db_name.db/" + tableName + "/y=" + year + "/m=" + month + "/d=" + day)

There as another approach as well. Assuming that you have a hive table over the directory you want to write to, one way to deal with this problem is to create a temp view from dataFrame which should be added to the table and then use a normal hive-like insert overwrite table ... command:

dataFrame.createOrReplaceTempView("temp_view")
spark.sql("insert overwrite table table_name partition ('eventdate', 'hour', 'processtime')select * from temp_view")

It will successfully preserve old partitions while (over)writing to only new partitions.

Learn Spark with this Spark Certification Course by Intellipaat.

How to partition and write DataFrame in Spark without deleting partitions with no new data?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources