Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.

I understand what it does, of course. However, it has always been my understanding that

the order of elements in an RDD is a meaningless concept
the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')

What is wrong with my understanding above?

1 Answer

0 votes
by (32.3k points)
edited by

It isn’t true that RDDs are always unordered. If an RDD is the result of a sortBy operation, then it is guaranteed that it will have an order. 

An RDD is not a set; it may contain duplicates.

Partitioning is not hidden from the caller, and can be controlled and queried.

Many operations do preserve both order and partitioning, such as map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.

If you want to know more about Spark, then do check out this awesome video tutorial:

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
...