Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Big Data Hadoop & Spark by (50.2k points)

Which is better groupByKey or reduceByKey?

1 Answer

0 votes
by (107k points)
edited ago by

There are two different ways to compute counts:

val words1 = Array("one", "two", "two", "three", "three", "three")

val wordPairsRDD = sc.parallelize(words1).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect() 

val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()

reduceByKey will aggregate y key before rearranging, and on the opposite hand, groupByKey can rearrange all the value key pairs because the diagrams show.

On large size data, the difference is obvious.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...