groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.
While aggregateByKey() is logically the same as reduceByKey() but it lets you return the result in different types. In other word, it lets you have input as type x and aggregate result as type y. For example (1,3),(1,2) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.
groupByKey | aggregateByKey |
Does not use Combiner | Uses Combiner |
No parameters as functions. Generally followed by map or flatMap | It requires 3 parameters |
No Combiner | Explicit Combiner |
AggregateByKey is same as combineByKey and there is a slight difference in functioning and arguments
Here we can pass the initial value, which will be used in each partition
The aggregateByKey function requires 3 parameters:
An initial ‘zero’ value will not affect the total values to be collected. For example, if we were adding numbers the initial value would be 0. Or in the case of collecting unique elements per key, the initial value would be an empty set.
A combining function accepting two parameters. The second parameter is merged into the first parameter. This function combines/merges values within a partition.
A merging function accepting two parameters. In this case, the parameters are merged into one. This step merges values across partitions.
Example:
Imagine you have a list of pairs. You parallelize it and then you want to "combine" them by key producing a sum. In this case, reduceByKey and aggregateByKey work similarly, as shown below:
Now, imagine that you want the aggregation to be a set of values that is of different types
If you want to know more about Spark, then do check out this awesome video tutorial: