Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
4 views
in Big Data Hadoop & Spark by (11.4k points)
Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner?

1 Answer

0 votes
by (32.3k points)

No. If two RDDs have the same partitioner, there will be no shuffle caused by the join. You can see this in CoGroupedRDD.scala:

override def getDependencies: Seq[Dependency[_]] = {

  rdds.map { rdd: RDD[_ <: Product2[K, _]] =>

    if (rdd.partitioner == Some(part)) {

      logDebug("Adding one-to-one dependency with " + rdd)

      new OneToOneDependency(rdd)

    } else {

      logDebug("Adding shuffle dependency with " + rdd)

      new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)

    }

  }

}

However, keep in mind that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).

This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
...