0 votes
1 view
in Big Data Hadoop & Spark by (33.2k points)

Which is better groupByKey or reduceByKey ?

1 Answer

0 votes
by (17.4k points)

When groupByKey() is applied on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.

Spark provides the provision to persist data to disk when there is more data shuffling onto a single executor machine than can fit in memory.

For Example:

val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)

val group = data.groupByKey().collect()

group.foreach(println)

But when reduceByKey is applied on a dataset (K, V), before the shuffling of data, the pairs on the same machine with the same key are combined.

Example:

val words = Array("one","two","two","four","five","six","six","eight","nine","ten")

val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)

data.collect.foreach(println)

I hope this answer would help you!

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...