Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.

val edwDf = omniDataFrame
  .withColumn("field1", callUDF((value: String) => None))
  .withColumn("field2",
    callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df"))))

edwDf
  .select("field1", "field2")
  .save("odsoutdatafldr", "com.databricks.spark.csv");

1 Answer

0 votes
by (32.3k points)

It is possible to use lit(null):

 

import org.apache.spark.sql.functions.{lit, udf}

 

case class Record(foo: Int, bar: String)

val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

 

val dfWithFoobar = df.withColumn("foobar", lit(null: String))

But here you have to deal with one problem, i.e. the column type is null:

 

scala> dfWithFoobar.printSchema

root

 |-- foo: integer (nullable = false)

 |-- bar: string (nullable = true)

 |-- foobar: null (nullable = true)

Also, it is not retained by the csv writer. And if it is a hard requirement you can cast column to the specific type (let’s say String), with either DataType

 

import org.apache.spark.sql.types.StringType

df.withColumn("foobar", lit(null).cast(StringType))

 

or string description

 

df.withColumn("foobar", lit(null).cast("string"))

 

or use an UDF like this:

 

val getNull = udf(() => None: Option[String]) // Or some other type

 

df.withColumn("foobar", getNull()).printSchema

root

 |-- foo: integer (nullable = false)

 |-- bar: string (nullable = true)

 |-- foobar: string (nullable = true)

Browse Categories

...