When groupByKey() is applied on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.
Spark provides the provision to persist data to disk when there is more data shuffling onto a single executor machine than can fit in memory.
For Example:
val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
val group = data.groupByKey().collect()
group.foreach(println)
But when reduceByKey is applied on a dataset (K, V), before the shuffling of data, the pairs on the same machine with the same key are combined.
Example:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.collect.foreach(println)
I hope this answer would help you!