Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values, each of which needs to be stored in their own separate column.

That functionality will be implemented in a UDF. However, I am not sure how to return a list of values from that UDF and feed these into individual columns. Below is a simple example:

from pyspark.sql.functions import udf
def udf_test(n):
    return [n/2, n%2]

test_udf=udf(udf_test)'amount','trans_date').withColumn("test", test_udf("amount")).show(4)
That produces the following:

|amount|trans_date|                test|
|  28.0|2016-02-07|         [14.0, 0.0]|
| 31.01|2016-02-07|[15.5050001144409...|
| 13.41|2016-02-04|[6.70499992370605...|
| 307.7|2015-02-17|[153.850006103515...|
| 22.09|2016-02-05|[11.0450000762939...|

only showing top 5 rows
What would be the best way to store the two (in this example) values being returned by the udf on separate columns? Right now they are being typed as strings:'amount','trans_date').withColumn("test", test_udf("amount")).printSchema()

 |-- amount: float (nullable = true)
 |-- trans_date: string (nullable = true)
 |-- test: string (nullable = true)

1 Answer

0 votes
by (32.3k points)

Creating multiple top level columns from a single UDF call, isn't possible but you can create a new struct. For that you will require an UDF with specified returnType.

Here is how I did it:




Now simply use select to flatten the schema:


Related questions

Browse Categories