0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones does not?

Map and filter does not. However, I am not sure with the others.

map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)

1 Answer

0 votes
by (32.5k points)

It is actually extremely easy to find this out, without the documentation. For any of these functions just create an RDD and call to debug string, here is one example you can do the rest on ur own.

Here, distinct creates a shuffle. And it is very important to find out this way rather than docs because many times there will be situations where a shuffle will be required or not required for a certain function. For example, usually there are situations where join requires a shuffle but if you join two RDD's that branch from the same RDD spark can sometimes omit the shuffle.

Generally, the operations given below might cause a shuffle:

cogroup

groupWith

join: hash partition

leftOuterJoin: hash partition

rightOuterJoin: hash partition

groupByKey: hash partition

reduceByKey: hash partition

combineByKey: hash partition

sortByKey: range partition

distinct

intersection: hash partition

repartition

coalesce

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...