Back

Explore Courses Blog Tutorials Interview Questions
0 votes
6 views
in Big Data Hadoop & Spark by (11.4k points)
I need to split an RDD into 2 parts:

1 part which satisfies a condition; another part which does not. I can do filter twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature.

1 Answer

0 votes
by (32.3k points)

Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick.

If it's really just two different types, you can use a helper method:

implicit class RDDOps[T](rdd: RDD[T]) {

  def partitionBy(f: T => Boolean): (RDD[T], RDD[T]) = {

    val passes = rdd.filter(f)

    val fails = rdd.filter(e => !f(e)) // Spark doesn't have filterNot

    (passes, fails)

  }

}

val (matches, matchesNot) = sc.parallelize(1 to 100).cache().partitionBy(_ % 2 == 0)

And keep in mind that as soon as you have multiple types of data, just assign the filtered data to a new val.

Browse Categories

...