Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (19.9k points)

I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:

df =

A  B

1  I bought 3 apples in 2013

3  I went to the store in 2020 and got milk

1  In 2015 and 2019 I went on holiday to Spain

2  When I was 17, in 2014 I got a new car

3  I got my present in 2018 and it broke down in 2019

What I would like is to extract all the values of > 1950 and have this as an end result:

A  B                                                    C

1  I bought 3 apples in 2013                            2013

3  I went to the store in 2020 and got milk             2020

1  In 2015 and 2019 I went on holiday to Spain          2015_2019

2  When I was 17, in 2014 I got a new car               2014

3  I got my present in 2018 and it broke down in 2019   2018_2019

I tried to extract values first, but didn't get further than:

df["C"] = df["B"].str.extract('(\d+)').astype(int)

df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())

But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?

1 Answer

0 votes
by (25.1k points)

Use this regex '\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})' with a lambda in df.apply like this:

pattern = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')

df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))

Browse Categories

...