0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns.

How would I go about changing a value in row x column y of a dataframe?

In pandas this would be df.ix[x,y] = new_value

1 Answer

0 votes
by (32.5k points)
edited by

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. You need to create a UserDefinedFunction first that is implementing the operation to apply and then selectively apply that function to the targeted column only.

 In Python:

from pyspark.sql.functions import UserDefinedFunction

from pyspark.sql.types import StringType

name = 'target_column'

udf = UserDefinedFunction(lambda x: 'new_value', StringType())

new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !