Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

As an example:

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))


I get the data frame:

+---+---+
| id|bar|
+---+---+
|  1|  a|
|  2|  b|
|  3|  b|
|  4|  c|
|  5|  d|
+---+---+


I only want to exclude rows where bar is ('a' or 'b').

Using an SQL expression string it would be:

df.filter('bar not in ("a","b")').show()


Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?

P.S.: I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

1 Answer

0 votes
by (32.3k points)

In your case, I think you should use ”~”, as it will provide you with the functionality that you need. Do something like:

 

df.filter(~col('bar').isin(['a','b'])).show()

 

 

 

+---+---+

| id|bar|

+---+---+

|  4| c|

|  5| d|

+---+---+

Related questions

Browse Categories

...