Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

According to Learning Spark

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

2 Answers

0 votes
by (32.3k points)
edited by

Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark application efficiently.

In Spark RDD API there are 2 methods available to increase or decrease the number of partitions.

repartition() method

coalesce() method

val data = 1 to 15

val numOfPartitions = 5

val rdd = spark.sparkContext.parallelize(data ,  numOfPartitions)

rdd.getNumPartitions 

5

rdd.saveAsTextFile("C:/npntraining/output_rdd")

In the above program we are creating an rdd with 5 partitions and then we are saving an rdd by invoking saveAsTextFile(str:String) method. If you open the output_rdd for each partition one output file will be created

part-00000 :

1 2 3

part-00001 : 

4 5 6

part-00002 : 

7 8 9

part-00003 : 

10 11 12

part-00004 : 

13 14 15

coalesce() method

coalesce() uses existing partitions to minimize the amount of data that’s shuffled

coalesce results in partitions with different amounts of data (sometimes partitions that have much different sizes)

val coalesceRDD = rdd.coalesceRDD(3)

coalesceRDD.saveAsTextFile("C:/npntraining/coalesce-output");

part-00000 :

1 2 3

part-00001 : 

4 5 6 7 8 9

part-00002 : 

10 11 12 13 14 15

If you analyze the output the data from partition part-00002 is merged with part-00001 and data from partition-0004 is merged with part-00002 hence minimize the amount of data that’s shuffled but results in partitions with different amounts of data.

repartition() method:

The repartition method can be used to either increase or decrease the number of partitions in a RDD.

val repartitionRDD = rdd.repartition(3)

repartition.saveAsTextFile("C:/npntraining/repartition-output");

part-00000 :

3 6 8 10 13

part-00001 : 

1 4 9 11 14

part-00002 : 

2 5 7 12 15

If you analyze the entire data is shuffled i.e but data is equally partitioned across the partitions.


 

Coalesce()

Repartition()

Used to reduce the number of partitions

Used to reduce or decrease the number of partitions.

Tries to minimize data movement by avoiding network shuffle.

A network shuffle will be triggered which can increase data movement.

Creates unequal sized partitions

Creates equal sized partitions


 

If you want to know more about Spark, then do check out this awesome video tutorial:

0 votes
by (33.1k points)

Repartition avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would look something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12


Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)


Notice that Node 1 and Node 3 did not require its original data to move.

I hope this answer would help you!

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...