Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

The Regex Extract all the substrings that meet these following criteria:

  • The first 4 chars are numbers and substring ends with a number or letter
  • The 15 or 18 chars long
  • If the 2 substrings meet the criteria, just return the first one

df1 = pd.DataFrame(data ={"Messy_IDS":["Looking for ID : 7010M000002N8c5T7A","5634M000002N8c5T7A,7010M000002N8c5T7A","https://website.com/12340000000f5F5"], "Desired_Output":["7010M000002N8c5T7A","5634M000002N8c5T7A","12340000000f5F5"]})

df1

        Messy_IDS Desired_Output

   0 Looking for ID : 7010M000002N8c5T7A 7010M000002N8c5T7A

   1 5634M000002N8c5T7A,7010M000002N8c5T7A 5634M000002N8c5T7A

   2 https://website.com/12340000000f5F5 12340000000f5F5

1 Answer

0 votes
by (36.8k points)

Use the Series.str.extract with match by regex for first 4 digits and then for the 11 or 14 digits or letters:

df['new'] = df['Messy_IDS'].str.extract('([0-9]{4}[0-9A-Za-z]{11,14})')

Or:

df['new'] = df['Messy_IDS'].str.extract('(\d{4}\w{11,14})')

print (df)

                               Messy_IDS      Desired_Output  \

0    Looking for ID : 7010M000002N8c5T7A  7010M000002N8c5T7A   

1  5634M000002N8c5T7A,7010M000002N8c5T7A  5634M000002N8c5T7A   

2    https://website.com/12340000000f5F5     12340000000f5F5   

                  new  

0  7010M000002N8c5T7A  

1  5634M000002N8c5T7A  

2     12340000000f5F5  

Improve your knowledge in data science from scratch using Data science online courses 

Browse Categories

...