0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType

For example

If my schema is:

foo
 |_bar
 |_baz
x
y
z


How do I select it into a flattened tabular form without resorting to manually running

df.select("foo.bar","foo.baz","x","y","z")


In other words, how do I obtain the result of the above code programmatically given just a StructType and a DataFrame

1 Answer

0 votes
by (31.4k points)

There is no accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType but you can do it with a recursive function that generates your select(...) statement by walking through the DataFrame.schema.

The recursive function should return an Array[Column]. Every time the function hits a StructType, it would call itself and append the returned Array[Column] to its own Array[Column].

Something like:

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {

  schema.fields.flatMap(f => {

    val colName = if (prefix == null) f.name else (prefix + "." + f.name)

    f.dataType match {

      case st: StructType => flattenSchema(st, colName)

      case _ => Array(col(colName))

    }

  })

}

You would then use it like this:

df.select(flattenSchema(df.schema):_*)

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...