Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.

What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.

My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?

Something that would make this:

categories = ['a', 'b', 'c']

   cat

1   a

2   b

3   a

Become this:

  cat_a  cat_b cat_c

1   1   0 0

2   0   1 0

3   1   0 0

1 Answer

0 votes
by (33.1k points)

Simply use transpose and reindex methods in pandas.

For example:

import pandas as pd

cats = ['a', 'b', 'c']

df = pd.DataFrame({'cat': ['a', 'b', 'a']})

dummies = pd.get_dummies(df, prefix='', prefix_sep='')

dummies = dummies.T.reindex(cats).T.fillna(0)

print(dummies)

Output:

    a    b c

0  1.0  0.0 0.0

1  0.0  1.0 0.0

2  1.0  0.0 0.0

Hope this answer helps.

Browse Categories

...