map vs mapValues in Spark

Question

4 Answers

Amit Rawat · Answer 1 · 2019-07-18T14:36:12+0000

map() is a transformation operation and is narrow in nature i.e. no data shuffling will take place between the partitions.It takes a function as an input argument which will be applied on each element and will return a new RDD. It is one of the most widely used operation in Spark RDD API.

If we use map() with a Pair RDD, we get access to both Key & value. There are times we might only be interested in accessing the value(& not key). In such case, we can use mapValues() instead of map().

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is clearer and shorter, so when you just want to transform the values and keep the keys as-it-is, it is recommended to use mapValues.

On the other hand, if in order to transform the keys too (e.g. in case you want to apply f: (A, B) => C), you simply can not use mapValues because it would only pass the values to your function.

Now, the last difference concerns partitioning: Applying any custom partitioning to your RDD (e.g. using partitionBy) using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

vinita · Answer 2 · 2019-07-31T06:01:12+0000

mapValues is only suitable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operate on the value only (the second part of the tuple), while the map operates on the entire record (tuple of key and value).

In other terms, given f: B => C and rdd: RDD[(A, B)], these two following are identical:

1.val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

2.val result: RDD[(A, C)] = rdd.mapValues(f)

The following is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to modify the keys too (e.g. if you want to apply

f: (A, B) => C), you just can't use mapValues because it would only pass the values to your function.

The last difference is based on partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that partitioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Hope this answer helps you!

ParasSharma1 · Answer 3 · 2019-08-24T06:33:35+0000

The map functions takes a function that transforms each element of a collection:

map(f: T => U)
RDD[T] => RDD[U]

Here T is a tuple we may want to only act on the values – not the keys mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W) Where RDD[ (K, V) ] => RDD[ (K, W) ]

Hope this answer helps you!

Anurag · Answer 4 · 2019-12-01T14:02:12+0000

Let's understand the difference between these two:

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operate on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }
val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Hope this will answer your query to some extent.

map vs mapValues in Spark

map vs mapValues in Spark

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions