How to assign unique contiguous numbers to elements in a Spark RDD

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T10:10:10+0000

In Spark 1.0+, we are provided with two methods to solve this problem:

RDD.zipWithIndex - It is just like Seq.zipWithIndex, it adds contiguous (Long) numbers to elements of your dataset and in order to do that it has to count the elements in each partition first, so your input will be evaluated twice. You have to Cache your input RDD if you want to use this.

It zips this RDD with its element indices. Within each partition of the dataset, the ordering is first based on the partition index and then on the items. In this way, the first partition gets index 0, and the last item in the last partition receives the largest index.

RDD.zipWithUniqueId - It also gives you unique Long IDs, but there is no guarantee that they are going to be contiguous. (The only possible case where they are going to be contiguous is when each partition has the same number of elements.) The upside is that this does not need to know anything about the input, therefore, it will not cause double-evaluation.

How to assign unique contiguous numbers to elements in a Spark RDD

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources