Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

See my attempt below, which results in an error.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()


## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |    |   2|
## |null|null|
## +----+----+

## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple

## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()


## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |null|   2|
## +----+----+

1 Answer

0 votes
by (32.3k points)

It is as simple as this:

from pyspark.sql.functions import col, when

def blank_as_null(x):

    return when(col(x) != "", col(x)).otherwise(None)

dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

dfWithEmptyReplaced.show()

## +----+----+

## |col1|col2|

## +----+----+

## | foo|   1|

## |null|   2|

## |null|null|

## +----+----+

dfWithEmptyReplaced.na.drop().show()

## +----+----+

## |col1|col2|

## +----+----+

## | foo|   1|

## +----+----+

In order to fill multiple columns you can use reduce:

to_convert = set([...]) # Some set of columns

reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)

or use comprehension:

exprs = [

    blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]

testDF.select(*exprs)

If you want to learn more about Big Data, visit Big Data Tutorial and Big Data Certification by Intellipaat.

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers

501 comments

104k users

Browse Categories

...