Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (50.2k points)

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

index groups  

0     ['a','b','c']

1     ['c']

2     ['b','c','e']

3     ['a','c']

4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

index  a b c   d e

0      1 1 1   0 0

1      0 0 1   0 0

2      0 1 1   0 1

3      1 0 1   0 0

4      0 1 0   0 0

pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

The solution needs to be efficient as the dataframe will contain 500,000+ rows. Any advice would be appreciated!

1 Answer

0 votes
by (108k points)

Just in case you have a large dataframe you can use the sklearn.preprocessing.MultiLabelBinarizer:

import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(

    {'groups':

        [['a','b','c'],

        ['c'],

        ['b','c','e'],

        ['a','c'],

        ['b','e']]

    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

    a   b c   e

0   1 1   1 0

1   0 0   1 0

2   0 1   1 1

3   1 0   1 0

4   0 1   0 1

For more information refer the following link:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Browse Categories

...