How does Distinct() function work in Spark?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-23T05:31:29+0000

Reading out the API docs I came to know about distinct function that:

When we apply distinct function on any rdd like: RDD.distinct(), it returns a new RDD contains the distinct elements of this existing RDD.

Now, according to my experience I can say that in a tuple-RDD the tuple as a whole is considered.

If you want distinct keys or distinct values, then depending on exactly what you want to accomplish, you can either:

A. call groupByKey() to transform {(k1,v11),(k1,v12),(k2,v21),(k2,v22)} to {(k1,[v11,v12]), (k2,[v21,v22])} ;

Or

B. strip down either the keys by calling keys() or you may strip down the values by calling values(), followed by distinct()[for both keys() and values()].

distinct uses the hashCode and equals method of the objects for this determination. Tuples come built in with the equality mechanisms delegating down into the equality and position of each object. So, distinct will work against the entire Tuple2 object.

How does Distinct() function work in Spark?

1 Answer

Related questions

Browse Categories