How to sort an RDD in Scala Spark?

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-09T10:22:38+0000

If you only need the top 10, use rdd.top(10). It is faster, as it avoids sorting.

rdd.top creates one parallel pass through the data, collecting the top N in each partition in a heap, and then it merges the heaps. It is an O(rdd.count) operation. Sorting would be O(rdd.count log rdd.count), and incur a lot of data transfer — it does a shuffle, so all of the data would be transmitted over the network.

If you want to know more about Spark, then do check out this awesome video tutorial:

How to sort an RDD in Scala Spark?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources