Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (50.2k points)

Which is better groupByKey or reduceByKey?

1 Answer

0 votes
by (108k points)

There are two different ways to compute counts:

val words1 = Array("one", "two", "two", "three", "three", "three")

val wordPairsRDD = sc.parallelize(words1).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect() 

val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()

reduceByKey will aggregate y key before rearranging, and on the opposite hand, groupByKey can rearrange all the value key pairs because the diagrams show.

On large size data, the difference is obvious.

Browse Categories

...