Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln


Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln

1 Answer

0 votes
by (32.3k points)

For Spark 1.5 and later, I would suggest you to use the functions package and do something like this:

from pyspark.sql.functions import *

newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Crisp explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.

  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.

If you wish to learn Pyspark visit this Pyspark Tutorial.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...