Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'm looking for a way to split an RDD into two or more RDDs.

1 Answer

0 votes
by (32.3k points)
edited by

It is not possible to yield multiple RDDs from a single transformation*. If you want to split an RDD you have to apply a filter for each split condition. For example:

def even(x): return x % 2 == 0

def odd(x): return not even(x)

rdd = sc.parallelize(range(20))

rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))

If you have only a binary condition and computation is expensive you may prefer something like this:

kv_rdd = rdd.map(lambda x: (x, odd(x)))

kv_rdd.cache()

rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()

rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()

It means only a single predicate computation but requires additional pass overall data.

It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.

If you want to know more about PySpark, then do check out this awesome video tutorial:

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

Browse Categories

...