Spark - repartition() vs coalesce()

Question

2 Answers

Amit Rawat · Answer 1 · 2019-07-10T04:34:34+0000

Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark application efficiently.

In Spark RDD API there are 2 methods available to increase or decrease the number of partitions.

repartition() method

coalesce() method

val data = 1 to 15
val numOfPartitions = 5
val rdd = spark.sparkContext.parallelize(data , numOfPartitions)
rdd.getNumPartitions
5
rdd.saveAsTextFile("C:/npntraining/output_rdd")

In the above program we are creating an rdd with 5 partitions and then we are saving an rdd by invoking saveAsTextFile(str:String) method. If you open the output_rdd for each partition one output file will be created

part-00000 :
1 2 3
part-00001 :
4 5 6
part-00002 :
7 8 9
part-00003 :
10 11 12
part-00004 :
13 14 15

coalesce() method

coalesce() uses existing partitions to minimize the amount of data that’s shuffled

coalesce results in partitions with different amounts of data (sometimes partitions that have much different sizes)

val coalesceRDD = rdd.coalesceRDD(3)
coalesceRDD.saveAsTextFile("C:/npntraining/coalesce-output");
part-00000 :
1 2 3
part-00001 :
4 5 6 7 8 9
part-00002 :
10 11 12 13 14 15

If you analyze the output the data from partition part-00002 is merged with part-00001 and data from partition-0004 is merged with part-00002 hence minimize the amount of data that’s shuffled but results in partitions with different amounts of data.

repartition() method:

The repartition method can be used to either increase or decrease the number of partitions in a RDD.

val repartitionRDD = rdd.repartition(3)
repartition.saveAsTextFile("C:/npntraining/repartition-output");
part-00000 :
3 6 8 10 13
part-00001 :
1 4 9 11 14
part-00002 :
2 5 7 12 15

If you analyze the entire data is shuffled i.e but data is equally partitioned across the partitions.

Coalesce()	Repartition()
Used to reduce the number of partitions	Used to reduce or decrease the number of partitions.
Tries to minimize data movement by avoiding network shuffle.	A network shuffle will be triggered which can increase data movement.
Creates unequal sized partitions	Creates equal sized partitions

If you want to know more about Spark, then do check out this awesome video tutorial:

Anurag · Answer 2 · 2019-12-01T14:24:06+0000

Repartition avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would look something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

I hope this answer would help you!

Spark - repartition() vs coalesce()

Please log in to add a comment.

Please log in to answer this question.

2 Answers

Please log in to add a comment.

Please log in to add a comment.

Related questions