Filter Pyspark dataframe column with None value

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T08:17:21+0000

I would suggest you to use Column.isNull / Column.isNotNull:

df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())

To simply drop NULL values, use na.drop with subset argument:

df.na.drop(subset=["dt_mvmt"])

Since, in SQL “NULL” is undefined, the equality based comparisons with NULL will not work. Therefore, any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+

sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+

“IS / IS NOT” is the only valid method to compare value with NULL. This method is also equivalent to the “isNull / isNotNull” method calls.

If you want to know more about PySpark, then do check out this awesome video tutorial:

Filter Pyspark dataframe column with None value

1 Answer

Related questions

Browse Categories