Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (50.2k points)

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

For example:

lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']

I currently apply the mask like this:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])

df = df[mask]

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

1 Answer

0 votes
by (108k points)

Here you will need to correctly escape any special characters in the substrings first to guarantee that they are matched and not used as regex metacharacters.

This is easy to do using re.escape:

>>> import re

>>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be combined using a regex pipe |. Each of the substrings can be verified against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst)

The masking stage through the rows becomes a single low-level loop :

df[col].str.contains(pattern, case=False)

For more info, you can check the following link:

Browse Categories