0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln

Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln

1 Answer

0 votes
by (31.4k points)

For Spark 1.5 and later, I would suggest you to use the functions package and do something like this:

from pyspark.sql.functions import *

newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Crisp explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.

  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.

If you wish to learn Pyspark visit this Pyspark Tutorial.
Welcome to Intellipaat Community. Get your technical queries answered by top developers !