0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:

df = df.withColumn('new_column',
    IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)

I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.

1 Answer

0 votes
by (32.5k points)

There are some efficient ways to implement this. Let’s start with the correct imports:

from pyspark.sql.functions import col, expr, when

You can use Hive IF function inside expr:

new_column_1 = expr(

    """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""


or when + otherwise:

new_column_2 = when(

    col("fruit1").isNull() | col("fruit2").isNull(), 3

).when(col("fruit1") == col("fruit2"), 1).otherwise(0)

Finally you may use the following trick:

from pyspark.sql.functions import coalesce, lit

new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))

With example data:

df = sc.parallelize([

    ("orange", "apple"), ("kiwi", None), (None, "banana"), 

    ("mango", "mango"), (None, None)

]).toDF(["fruit1", "fruit2"])

you can use:


    .withColumn("new_column_1", new_column_1)

    .withColumn("new_column_2", new_column_2)

    .withColumn("new_column_3", new_column_3))

and the result will be displayed as:




|orange| apple|           0| 0| 0|

|  kiwi|  null|         3| 3|           3|

|  null|banana|           3| 3|   3|

| mango| mango|           1| 1| 1|

|  null|  null|         3| 3|           3|


If you wish to learn Pyspark visit this Pyspark Certification.

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
Welcome to Intellipaat Community. Get your technical queries answered by top developers !