Pyspark filter dataframe by columns of another dataframe

Question

asked Jul 24, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

df1:

+------+----------+--------------------+
|userid|   group |      all_picks     |
+------+----------+--------------------+
|   348|         2|[225, 2235, 2225]   |
|   567|         1|[1110, 1150]        |
|   595|         1|[1150, 1150, 1150] |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

df2:

+------+----------+---------+
|userid|   group |   pick |
+------+----------+---------+
|   348|         2|     2270|
|   595|         1|     2125|
+------+----------+---------+

Result I want:

+------+----------+--------------------+
|userid|   group |      all_picks     |
+------+----------+--------------------+
|   567|         1|[1110, 1150]        |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

1 Answer

Amit Rawat · Answer 1 · 2019-07-24T15:14:11+0000

You will get you desired result using LEFT ANTI JOIN:

df1.join(df2, ["userid", "group"], "leftanti")

Also the same result can be achieved with left outer join as well:

(df1
    .join(df2, ["userid", "group"], "leftouter")
    .where(df2["pick"].isNull())
    .drop(df2["pick"]))

Learn Pyspark with the help of Pyspark Course by Intellipaat.

Pyspark filter dataframe by columns of another dataframe

1 Answer

Related questions

Browse Categories