0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.

val edwDf = omniDataFrame
  .withColumn("field1", callUDF((value: String) => None))
    callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df"))))

  .select("field1", "field2")
  .save("odsoutdatafldr", "com.databricks.spark.csv");

1 Answer

0 votes
by (32.2k points)

It is possible to use lit(null):


import org.apache.spark.sql.functions.{lit, udf}


case class Record(foo: Int, bar: String)

val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF


val dfWithFoobar = df.withColumn("foobar", lit(null: String))

But here you have to deal with one problem, i.e. the column type is null:


scala> dfWithFoobar.printSchema


 |-- foo: integer (nullable = false)

 |-- bar: string (nullable = true)

 |-- foobar: null (nullable = true)

Also, it is not retained by the csv writer. And if it is a hard requirement you can cast column to the specific type (let’s say String), with either DataType


import org.apache.spark.sql.types.StringType

df.withColumn("foobar", lit(null).cast(StringType))


or string description


df.withColumn("foobar", lit(null).cast("string"))


or use an UDF like this:


val getNull = udf(() => None: Option[String]) // Or some other type


df.withColumn("foobar", getNull()).printSchema


 |-- foo: integer (nullable = false)

 |-- bar: string (nullable = true)

 |-- foobar: string (nullable = true)

Welcome to Intellipaat Community. Get your technical queries answered by top developers !