Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?

1 Answer

0 votes
by (32.3k points)

Let’s assume that you have to join two tables abc and pqr on abc.id=pqr.id. And table abc has skew on id=1.

i.e. select abc.id from abc join pqr on abc.id = pqr.id

Now, to solve the skew join issue in such cases just break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -

 1. select abc.id from abc join pqr on abc.id = pqr.id where abc.id <> 1;

 2. select abc.id from abc join pqr on abc.id = pqr.id where abc.id = 1 and pqr.id = 1;

Now, the first query won’t have any skew, so all the tasks of ResultStage will finish at roughly the same time.

If we assume that pqr has only few rows with pqr.id = 1, then it will fit into memory. So, the second query will convert into a broadcast join. This is also called Map-side join in Hive.

Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

The partial results of the two queries can then be merged to get the final results.

Also, I would suggest you to visit this article: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

Related questions

0 votes
1 answer
asked Jul 25, 2019 in Data Science by Aarav (11.4k points)
0 votes
1 answer
asked Dec 16, 2020 in Python by laddulakshana (16.4k points)
0 votes
1 answer

Browse Categories

...