Back

Explore Courses Blog Tutorials Interview Questions
+2 votes
2 views
in Data Science by (11.4k points)
I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?

4 Answers

+2 votes
by (32.3k points)

map() is a transformation operation and is narrow in nature i.e. no data shuffling will take place between the partitions.It takes a function as an input argument which will be applied on each element and will return a new RDD. It is one of the most widely used operation in Spark RDD API. 

If we use map() with a Pair RDD, we get access to both Key & value. There are times we might only be interested in accessing the value(& not key). In such case, we can use mapValues() instead of map().

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is clearer and shorter, so when you just want to transform the values and keep the keys as-it-is, it is recommended to use mapValues.

On the other hand, if in order to transform the keys too (e.g. in case you want to apply f: (A, B) => C), you simply can not use mapValues because it would only pass the values to your function.

Now, the last difference concerns partitioning: Applying any custom partitioning to your RDD (e.g. using partitionBy) using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

+2 votes
by (108k points)
mapValues is only suitable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operate on the value only (the second part of the tuple), while the map operates on the entire record (tuple of key and value).

In other terms, given f: B => C and rdd: RDD[(A, B)], these two following are identical:

1.val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

2.val result: RDD[(A, C)] = rdd.mapValues(f)

The following is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to modify the keys too (e.g. if you want to apply

f: (A, B) => C), you just can't use mapValues because it would only pass the values to your function.

The last difference is based on partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Hope this answer helps you!
0 votes
by (19k points)

The map functions takes a function that transforms each element of a collection:

map(f: T => U)

RDD[T] => RDD[U]

Here T is a tuple we may want to only act on the values – not the keys mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W) Where RDD[ (K, V) ] => RDD[ (K, W) ]

Hope this answer helps you!

0 votes
by (33.1k points)

Let's understand the difference between these two:

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operate on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)


The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Hope this will answer your query to some extent.

Related questions

0 votes
1 answer
0 votes
2 answers
0 votes
1 answer
0 votes
1 answer

Browse Categories

...