Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I'm struggling with slicing. I thought that generally it's easy and I understand it but when it comes to the below situation my ideas don't work.

Situation: In one of my columns in DF I want to remove in all rows some string that sometimes occurs and sometimes doesn't.

The problem looks like this:

1.I don't know the exact position when this string starts (in each row it could be a different

2.This string various, depending on each row, however, it always starts from the same structure - let's say: "¯main_"

3.After "¯main_" usually, there're some numbers (it various) however the length always is the same (9 numbers)

4.I'm already after splitting and I have around ~40 columns (each with a similar problem). That's why I'm looking for some more efficient way to solve it then splitting, generating ~40 more columns and then dropping them.

5.Sometimes after this string with "¯main_" there's some additional string I'd like to leave in the same column.

Example:

Column1

A1-19

B2-52

C3-1245¯main_123456789

D4

Z89028

F7¯main_123456789,Z241

Looking for a result like this:

Column1

A1-19

B2-52

C3-1245

D4

Z89028

F7,Z241

The best solution that I prepared up till now:

a = test.find("¯")

b = a+14

df[0].str.slice(start = a, stop = b)

But:

1.It doesn't work properly

2.And I'm aware that test.find() returns -1 when it won't find a character. I don't know how to escape from it - writing a loop? I believe that some better (more efficient) solution exists. However, after a few hours of looking for it, I decided to find help.

1 Answer

0 votes
by (41.4k points)

So follow these steps:

1.Loop by all column

2.Then, split by position

3.After that, append extracted strings by positions to helper list.

4. At last, assign back to column

print (df)

                   Column1

0                      NaN

1                    B2-52

2  C3-1245 ¯main_123456789

3                       D4

4                   Z89028

5  F7 ¯main_123456789,Z241

for c in df.columns:

    out = []

    for x in df[c]:

        if x == x:

            p = x.find('¯')

            if p != -1:

                out.append(x[:p] + x[p+14:])

            else:

                out.append(x)

        else:

            out.append(x)

    df[c] = out

print (df)

     Column1

0        NaN

1      B2-52

2  C3-1245 9

3         D4

4     Z89028

5  F7 9,Z241

If you wish to learn What is Data Science visit this Data Science Online Course.

Browse Categories

...