groupByKey vs reduceByKey in Apache Spark.

Question

1 Answer

ParasSharma1 · Answer 1 · 2019-12-01T13:41:33+0000

When groupByKey() is applied on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.

Spark provides the provision to persist data to disk when there is more data shuffling onto a single executor machine than can fit in memory.

For Example:

val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
val group = data.groupByKey().collect()
group.foreach(println)

But when reduceByKey is applied on a dataset (K, V), before the shuffling of data, the pairs on the same machine with the same key are combined.

Example:

val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.collect.foreach(println)

I hope this answer would help you!

groupByKey vs reduceByKey in Apache Spark.

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources