Automatically and Elegantly flatten DataFrame in Spark SQL

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-15T14:20:44+0000

There is no accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType but you can do it with a recursive function that generates your select(...) statement by walking through the DataFrame.schema.

The recursive function should return an Array[Column]. Every time the function hits a StructType, it would call itself and append the returned Array[Column] to its own Array[Column].

Something like:

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colName = if (prefix == null) f.name else (prefix + "." + f.name)
    f.dataType match {
      case st: StructType => flattenSchema(st, colName)
      case _ => Array(col(colName))
    }
  })
}

You would then use it like this:

df.select(flattenSchema(df.schema):_*)

Automatically and Elegantly flatten DataFrame in Spark SQL

1 Answer

Related questions

Browse Categories