Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to filter a PySpark dataframe that has None as a row value:'dt_mvmt').distinct().collect()


and I can filter correctly with an string value:

df[df.dt_mvmt == '2016-03-31']
# some results here

but this fails:

df[df.dt_mvmt == None].count()
df[df.dt_mvmt != None].count()

But there are definitely values on each category. What's going on?

1 Answer

0 votes
by (32.3k points)
edited by

I would suggest you to use Column.isNull / Column.isNotNull:



To simply drop NULL values, use na.drop with subset argument:["dt_mvmt"])

Since, in SQL “NULL” is undefined, the equality based comparisons with NULL will not work. Therefore, any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()

## +-------------+

## |(NULL = NULL)|

## +-------------+

## |         null|

## +-------------+


sqlContext.sql("SELECT NULL != NULL").show()

## +-------------------+

## |(NOT (NULL = NULL))|

## +-------------------+

## |               null|

## +-------------------+

“IS / IS NOT” is the only valid method to compare value with NULL. This method is also equivalent to the  “isNull / isNotNull” method calls.

If you want to know more about PySpark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers


104k users

Browse Categories