Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Big Data Hadoop & Spark by (11.4k points)

I have a dataframe with the following structure:

 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- keyNote: struct (nullable = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- note: string (nullable = true)
 |    |-- details: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)


How it is possible to flatten the structure and create a new dataframe:

     |-- id: long (nullable = true)
     |-- keyNote: struct (nullable = true)
     |    |-- key: string (nullable = true)
     |    |-- note: string (nullable = true)
     |-- details: map (nullable = true)
     |    |-- key: string
     |    |-- value: string (valueContainsNull = true)

1 Answer

0 votes
by (32.3k points)

I suggest you to use the function given below, it does exactly what you want and it can deal with multiple nested columns containing columns with same name:

def flatten_df(nested_df):

    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']

    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']

    flat_df = nested_df.select(flat_cols +

                               [F.col(nc+'.'+c).alias(nc+'_'+c)

                                for nc in nested_cols

                                for c in nested_df.select(nc+'.*').columns])

    return flat_df

Before:

root

 |-- x: string (nullable = true)

 |-- y: string (nullable = true)

 |-- foo: struct (nullable = true)

 |    |-- a: float (nullable = true)

 |    |-- b: float (nullable = true)

 |    |-- c: integer (nullable = true)

 |-- bar: struct (nullable = true)

 |    |-- a: float (nullable = true)

 |    |-- b: float (nullable = true)

 |    |-- c: integer (nullable = true)

After:

root

 |-- x: string (nullable = true)

 |-- y: string (nullable = true)

 |-- foo_a: float (nullable = true)

 |-- foo_b: float (nullable = true)

 |-- foo_c: integer (nullable = true)

 |-- bar_a: float (nullable = true)

 |-- bar_b: float (nullable = true)

 |-- bar_c: integer (nullable = true)

Browse Categories

...