Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?

1 Answer

0 votes
by (32.3k points)

Let’s assume that you have to join two tables abc and pqr on And table abc has skew on id=1.

i.e. select from abc join pqr on =

Now, to solve the skew join issue in such cases just break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -

 1. select from abc join pqr on = where <> 1;

 2. select from abc join pqr on = where = 1 and = 1;

Now, the first query won’t have any skew, so all the tasks of ResultStage will finish at roughly the same time.

If we assume that pqr has only few rows with = 1, then it will fit into memory. So, the second query will convert into a broadcast join. This is also called Map-side join in Hive.


The partial results of the two queries can then be merged to get the final results.

Also, I would suggest you to visit this article:

Related questions

0 votes
1 answer
asked Jul 25, 2019 in Data Science by Aarav (11.4k points)
0 votes
1 answer
asked Dec 16, 2020 in Python by laddulakshana (16.4k points)
0 votes
1 answer

Browse Categories