Let’s assume that you have to join two tables abc and pqr on abc.id=pqr.id. And table abc has skew on id=1.
i.e. select abc.id from abc join pqr on abc.id = pqr.id
Now, to solve the skew join issue in such cases just break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data. In the above example. query will become -
1. select abc.id from abc join pqr on abc.id = pqr.id where abc.id <> 1;
2. select abc.id from abc join pqr on abc.id = pqr.id where abc.id = 1 and pqr.id = 1;
Now, the first query won’t have any skew, so all the tasks of ResultStage will finish at roughly the same time.
If we assume that pqr has only few rows with pqr.id = 1, then it will fit into memory. So, the second query will convert into a broadcast join. This is also called Map-side join in Hive.
The partial results of the two queries can then be merged to get the final results.
Also, I would suggest you to visit this article: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/