Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T09:59:00+0000

One way to do to join your RDDs is to create a custom partitioner and then use zipPartitions. I would suggest you to follow the approach given below:

import org.apache.spark.HashPartitioner
class RDD2Partitioner(partitions: Int) extends HashPartitioner(partitions) {
  override def getPartition(key: Any): Int = key match {
    case k: Tuple2[Int, String] => super.getPartition(k._1)
    case _ => super.getPartition(key)
  }
}
val numSplits = 8
val rdd1 = sc.parallelize(Seq((1, "X"), (2, "Y"), (3, "Z"))).partitionBy(new HashPartitioner(numSplits))
val rdd2 = sc.parallelize(Seq(((1, "M"), 111), ((1, "MM"), 111), ((1, "NN"), 123), ((2, "Y"), 222), ((3, "X"), 333))).partitionBy(new RDD2Partitioner(numSplits))
val result = rdd2.zipPartitions(rdd1)(
  (iter2, iter1) => {
    val m = iter1.toMap
    for {
        ((t: Int, w), u) <- iter2
        if m.contains(t)
      } yield ((t, w), (u, m.get(t).get))
  }
).partitionBy(new HashPartitioner(numSplits))
result.glom.collect

Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions