Create new Dataframe with empty/null field values

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-18T14:40:35+0000

It is possible to use lit(null):

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))

But here you have to deal with one problem, i.e. the column type is null:

scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)

Also, it is not retained by the csv writer. And if it is a hard requirement you can cast column to the specific type (let’s say String), with either DataType

import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))

or string description

df.withColumn("foobar", lit(null).cast("string"))

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: string (nullable = true)

Create new Dataframe with empty/null field values

1 Answer

Related questions

Browse Categories