Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (50.2k points)

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?

I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

1 Answer

0 votes
by (32.3k points)
edited by

To force Spark to execute a transformation, you'll need to require a result. So, for that after performing transformations you need to provide an active operation. Sometimes a simple count action is sufficient.

RDDs support two types of operations:

Transformations - These are the operations that create a new dataset from an existing one.

Actions - An Action operation returns a value to the driver program after running a computation on the dataset.

For example, Map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, Reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away.

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through a map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk or replicated across multiple nodes.

For more information regarding spark, refer the following video tutorial:

Related questions

0 votes
1 answer
0 votes
1 answer

Browse Categories