Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

df1:


+------+----------+--------------------+
|userid|   group  |      all_picks     |
+------+----------+--------------------+
|   348|         2|[225, 2235, 2225]   |
|   567|         1|[1110, 1150]        |
|   595|         1|[1150, 1150, 1150]  |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

df2:


+------+----------+---------+
|userid|   group  |   pick  |
+------+----------+---------+
|   348|         2|     2270|
|   595|         1|     2125|
+------+----------+---------+

Result I want:


+------+----------+--------------------+
|userid|   group  |      all_picks     |
+------+----------+--------------------+
|   567|         1|[1110, 1150]        |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

1 Answer

0 votes
by (32.3k points)

You will get you desired result using LEFT ANTI JOIN:

df1.join(df2, ["userid", "group"], "leftanti")

Also the same result can be achieved with left outer join as well:

(df1

    .join(df2, ["userid", "group"], "leftouter")

    .where(df2["pick"].isNull())

    .drop(df2["pick"]))

Learn Pyspark with the help of Pyspark Course by Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers

501 comments

104k users

Browse Categories

...