Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark application efficiently.
In Spark RDD API there are 2 methods available to increase or decrease the number of partitions.
val data = 1 to 15
val numOfPartitions = 5
val rdd = spark.sparkContext.parallelize(data , numOfPartitions)
In the above program we are creating an rdd with 5 partitions and then we are saving an rdd by invoking saveAsTextFile(str:String) method. If you open the output_rdd for each partition one output file will be created
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
coalesce() uses existing partitions to minimize the amount of data that’s shuffled
coalesce results in partitions with different amounts of data (sometimes partitions that have much different sizes)
val coalesceRDD = rdd.coalesceRDD(3)
1 2 3
4 5 6 7 8 9
10 11 12 13 14 15
If you analyze the output the data from partition part-00002 is merged with part-00001 and data from partition-0004 is merged with part-00002 hence minimize the amount of data that’s shuffled but results in partitions with different amounts of data.
The repartition method can be used to either increase or decrease the number of partitions in a RDD.
val repartitionRDD = rdd.repartition(3)
3 6 8 10 13
1 4 9 11 14
2 5 7 12 15
If you analyze the entire data is shuffled i.e but data is equally partitioned across the partitions.
Used to reduce the number of partitions
Used to reduce or decrease the number of partitions.
Tries to minimize data movement by avoiding network shuffle.
A network shuffle will be triggered which can increase data movement.
Creates unequal sized partitions
Creates equal sized partitions
If you want to know more about Spark, then do check out this awesome video tutorial: