I have a list of Tuples of type : (user id, name, count).
For example,
val x = sc.parallelize(List(
("a", "b", 1),
("a", "b", 1),
("c", "b", 1),
("a", "d", 1))
)
I'm attempting to reduce this collection to a type where each element name is counted.
So in above val x is converted to :
(a,ArrayBuffer((d,1), (b,2)))
(c,ArrayBuffer((b,1)))
Here is the code I am currently using :
val byKey = x.map({case (id,uri,count) => (id,uri)->count})
val grouped = byKey.groupByKey
val count = grouped.map{case ((id,uri),count) => ((id),(uri,count.sum))}
val grouped2: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey
grouped2.foreach(println)
I'm attempting to use reduceByKey as it performs faster than groupByKey.
How can reduceByKey be implemented instead of above code to provide the same mapping ?