4 views

What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions? Thanks.

(edit) i.e. what is the difference (either semantically or in terms of execution) between

def map[A, B](rdd: RDD[A], fn: (A => B))
(implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) },
preservesPartitioning = true)
}

And:

def map[A, B](rdd: RDD[A], fn: (A => B))
(implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
rdd.map(fn)
}

by (32.3k points)
edited

Both MapPartitions() and Map() falls under the transformation types of RDD.

Map():

It returns a new RDD by applying the given function to each element of the RDD. The function in the map returns only one item.

MapPartitions():

Similar to map, but runs separately on each partition (block) of the RDD, so the function must be of type Iterator<T> ⇒ Iterator<U> when running on an RDD of type T.

Let me explain you with an example. If there are 2000 row and 20 partitions, then each partition will contain the 2000/20=100 Rows.

Now, when we apply map(func) method to rdd, the func() operation will be applied on each and every Row and in this particular case func() operation will be called 2000 times. i.e. time-consuming in some time-critical applications.

If we call mapPartition(func) method on rdd, the func() operation will be called on each partition instead of each row. In this particular case, it will be called 20 times(number of the partition). In this way, you can prevent some processing when it comes to time-consuming application.

Map() exercises function at per element level whereas MapPartitions() exercises function at the partition level.

Now talking about similarity of flatMap() as compared to Map() and MapPartitions(), flatMap() neither works on a single element as map() nor it produces multiple elements of the result as mapPartitions().

If you want to know more about Spark, then do check out this awesome video tutorial:

by (33.1k points)

map():

The map function is applicable to both Scala's Mutable and Immutable collection data structures. The map method takes a predicate function and applies it to every element in the collection. It creates a new collection with the result of the predicate function applied to each and every element of the collection.

mapPartitions() :

> mapPartitions() can be used as an alternative to map() and foreach() .

> mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD

> Hence one can do the initialization on per-partition basis rather than each element basis