Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones out of them. I use distinct() function. I'm wondering on what basis does the function consider that tuples as disparate..? Is it based on the keys, or values, or both?

1 Answer

0 votes
by (32.3k points)

Reading out the API docs I came to know about distinct function that:

When we apply distinct function on any rdd like: RDD.distinct(), it returns a new RDD contains the distinct elements of this existing RDD.

Now, according to my experience I can say that in a tuple-RDD the tuple as a whole is considered.

If you want distinct keys or distinct values, then depending on exactly what you want to accomplish, you can either:

A. call groupByKey() to transform {(k1,v11),(k1,v12),(k2,v21),(k2,v22)} to {(k1,[v11,v12]), (k2,[v21,v22])} ; 


B. strip down either the keys by calling keys() or you may strip down the values by calling values(), followed by distinct()[for both keys() and values()].

distinct uses the hashCode and equals method of the objects for this determination. Tuples come built in with the equality mechanisms delegating down into the equality and position of each object. So, distinct will work against the entire Tuple2 object.

Related questions

Browse Categories