pandas.factorize on an entire data frame

Question

1 Answer

Anurag · Answer 1 · 2019-07-29T10:18:19+0000

You can use this method to factorize each column separately:

For example:

df = pd.DataFrame({'A':['type1','type2','type2'],
'B':['type1','type2','type3'],
'C':['type1','type3','type3']})
print (df)

Output:

A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3

print (df.apply(lambda x: pd.factorize(x)[0]))

#Output
A B C
0 0 0 0
1 1 1 1
2 1 2 1

print (df.stack().rank(method='dense').unstack())

#Output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0

To apply on the column:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)

#Output
A B C
0 type1 1.0 1.0
1 type2 2.0 3.0
2 type2 3.0 3.0

Using factorize:

stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)

A B C
0 type1 0 0
1 type2 1 2
2 type2 2 2

Mapping using dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]
d1 = dict(zip(b, vals))
print (d1)

{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}

df1 = df.stack().rank(method='dense').unstack()
print (df1)

A B C
0 1.0 1.0 1.0
1 2.0 2.0 3.0
2 2.0 3.0 3.0

print (df1.stack().map(d1).unstack())

A B C
0 type1 type1 type1
1 type2 type2 type3
2 type2 type3 type3

Hope this answer helps you!

If you want to learn Python for Data Science then you can watch this Python tutorial: