Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln


Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln

1 Answer

0 votes
by (32.3k points)

For Spark 1.5 and later, I would suggest you to use the functions package and do something like this:

from pyspark.sql.functions import *

newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Crisp explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.

  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.

If you wish to learn Pyspark visit this Pyspark Tutorial.

Browse Categories

...