0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I'm trying to filter a PySpark dataframe that has None as a row value:

df.select('dt_mvmt').distinct().collect()

[Row(dt_mvmt=u'2016-03-27'),
 Row(dt_mvmt=u'2016-03-28'),
 Row(dt_mvmt=u'2016-03-29'),
 Row(dt_mvmt=None),
 Row(dt_mvmt=u'2016-03-30'),
 Row(dt_mvmt=u'2016-03-31')]


and I can filter correctly with an string value:

df[df.dt_mvmt == '2016-03-31']
# some results here

but this fails:

df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0

But there are definitely values on each category. What's going on?

1 Answer

0 votes
by (31.4k points)
edited by

I would suggest you to use Column.isNull / Column.isNotNull:

df.where(col("dt_mvmt").isNull())

df.where(col("dt_mvmt").isNotNull())

To simply drop NULL values, use na.drop with subset argument:

df.na.drop(subset=["dt_mvmt"])

Since, in SQL “NULL” is undefined, the equality based comparisons with NULL will not work. Therefore, any attempt to compare it with another value returns NULL:

sqlContext.sql("SELECT NULL = NULL").show()

## +-------------+

## |(NULL = NULL)|

## +-------------+

## |         null|

## +-------------+


 

sqlContext.sql("SELECT NULL != NULL").show()

## +-------------------+

## |(NOT (NULL = NULL))|

## +-------------------+

## |               null|

## +-------------------+

“IS / IS NOT” is the only valid method to compare value with NULL. This method is also equivalent to the  “isNull / isNotNull” method calls.

If you want to know more about PySpark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...