Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm.

The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.

Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.

I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n, then call zip on the two RDDs.

1 Answer

0 votes
by (32.3k points)

In Spark 1.0+,  we are provided with two methods to solve this problem:

  • RDD.zipWithIndex - It is just like Seq.zipWithIndex, it adds contiguous (Long) numbers to elements of your dataset and in order to do that it has to count the elements in each partition first, so your input will be evaluated twice. You have to Cache your input RDD if you want to use this.

It zips this RDD with its element indices. Within each partition of the dataset, the ordering is first based on the partition index and then on the items. In this way, the first partition gets index 0, and the last item in the last partition receives the largest index.

  • RDD.zipWithUniqueId - It also gives you unique Long IDs, but there is no guarantee that they are going to be contiguous. (The only possible case where they are going to be contiguous is when each partition has the same number of elements.) The upside is that this does not need to know anything about the input, therefore, it will not cause double-evaluation.

Browse Categories

...