Updating a dataframe column in spark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-10T08:22:52+0000

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. You need to create a UserDefinedFunction first that is implementing the operation to apply and then selectively apply that function to the targeted column only.

In Python:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value

If you want to know more about Spark, then do check out this awesome video tutorial:

Updating a dataframe column in spark

Please log in to add a comment.

Please log in to answer this question.

1 Answer

Please log in to add a comment.

Related questions