Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

pandas.factorize encodes input values as an enumerated type or categorical variable.

But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?

Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.

enter image description here

1 Answer

0 votes
by (33.1k points)
edited by

You can use this method to factorize each column separately:

For example:

df = pd.DataFrame({'A':['type1','type2','type2'],

                   'B':['type1','type2','type3'],

                   'C':['type1','type3','type3']})

print (df)
 

Output:

       A      B      C

0  type1  type1  type1

1  type2  type2  type3

2  type2  type3  type3

print (df.apply(lambda x: pd.factorize(x)[0]))

 #Output

  A  B  C

0  0  0  0

1  1  1  1

2  1  2  1

print (df.stack().rank(method='dense').unstack())

#Output

     A    B    C

0  1.0  1.0  1.0

1  2.0  2.0  3.0

2  2.0  3.0  3.0

To apply on the column:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()

print (df)

#Output

       A    B    C

0  type1  1.0  1.0

1  type2  2.0  3.0

2  type2  3.0  3.0

Using factorize:

stacked = df[['B','C']].stack()

df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()

print (df)

       A  B  C

0  type1  0  0

1  type2  1  2

2  type2  2  2

Mapping using dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values

b = [x for x in df.stack().drop_duplicates().rank(method='dense')]

d1 = dict(zip(b, vals))

print (d1)

{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}

df1 = df.stack().rank(method='dense').unstack()

print (df1)

     A    B    C

0  1.0  1.0  1.0

1  2.0  2.0  3.0

2  2.0  3.0  3.0

print (df1.stack().map(d1).unstack())

       A      B      C

0  type1  type1  type1

1  type2  type2  type3

2  type2  type3  type3

Hope this answer helps you!

If you want to learn  Python for Data Science then you can watch this Python tutorial:

Browse Categories

...