Derive multiple columns from a single column in a Spark DataFrame

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T10:08:14+0000

As per my knowledge I don’t think there is any direct approach to derive multiple columns from a single column of a dataframe. However, UDF can return only a single column at the time. And this limitation can be overpowered in two ways.

1st approach:

Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well.

import org.apache.spark.sql.functions.udf
val df = Seq(
(1L, 3.0, "x"), (2L, -1.0, "y"), (3L, 0.0, "z")
).toDF("a", "b", "c")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((a: Long, b: Double, c: String) =>
Foobar(a * b, c.head.toInt * b))
val df1 = df.withColumn("foobar", foobarUdf($"a", $"b", $"c"))
df1.show
// +---+----+---+------------+
// | a| b| c| foobar|
// +---+----+---+------------+
// | 1| 3.0| x| [3.0,291.0]|
// | 2|-1.0| y|[-2.0,-98.0]|
// | 3| 0.0| z| [0.0,0.0]|
// +---+----+---+------------+
df1.printSchema
// root
// |-- a: long (nullable = false)
// |-- b: double (nullable = false)
// |-- c: string (nullable = true)
// |-- foobar: struct (nullable = true)
// | |-- foo: double (nullable = false)
// | |-- bar: double (nullable = false)

2nd Approach:

Switch to RDD, reshape and rebuild DF:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
def foobarFunc(a: Long, b: Double, c: String): Seq[Any] =
  Seq(a * b, c.head.toInt * b)
val schema = StructType(df.schema.fields ++
  Array(StructField("foo", DoubleType), StructField("bar", DoubleType)))
val rows = df.rdd.map(r => Row.fromSeq(
  r.toSeq ++
  foobarFunc(r.getAs[Long]("a"), r.getAs[Double]("b"), r.getAs[String]("c"))))
val df2 = sqlContext.createDataFrame(rows, schema)
df2.show
// +---+----+---+----+-----+
// | a| b| c| foo| bar|
// +---+----+---+----+-----+
// | 1| 3.0| x| 3.0|291.0|
// | 2|-1.0| y|-2.0|-98.0|
// | 3| 0.0| z| 0.0| 0.0|
// +---+----+---+----+-----+

Derive multiple columns from a single column in a Spark DataFrame

1 Answer

Related questions

Browse Categories