0 votes
1 view
in Machine Learning by (16k points)

All four functions seem really similar to me. In some situations some of them might give the same result, some not. Any help will be thankfully appreciated!

Now I know and I assume that internally, factorize and LabelEncoder work the same way and having no big differences in terms of results. I am not sure whether they will take up similar time with large magnitudes of data.

get_dummies and OneHotEncoder will yield the same result but OneHotEncoder can only handle numbers but get_dummies will take all kinds of input. get_dummies will generate new column names automatically for each column input, but OneHotEncoder will not (it rather will assign new column names 1,2,3....). So get_dummies is better in all respectives.

Please correct me if I am wrong! Thank you!

1 Answer

0 votes
by (33.2k points)

These four encoders can be split into two categories:

  • Encode labels into categorical variables: Using Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
  • Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by the distinct value of the encoded categorical variable.

The major difference between pandas and scikit-learn encoders is that scikit-learn encoders are built to be used in scikit-learn pipelines with the fit and transform methods.

Encode labels into categorical variables

Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables. 

For example:

from sklearn import preprocessing    

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])    

df['Fact'] = pd.factorize(df['Col'])[0]

le = preprocessing.LabelEncoder()

df['Lab'] = le.fit_transform(df['Col'])

print(df)

#   Col  Fact  Lab

# 0   A     0    0

# 1   B     1    1

# 2   B     1    1

# 3   C     2    2

Encode categorical variable into dummy/indicator (binary) variables:

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])

df = pd.get_dummies(df)

print(df)

#    Col_A  Col_B  Col_C

# 0    1.0    0.0    0.0

# 1    0.0    1.0    0.0

# 2    0.0    1.0    0.0

# 3    0.0    0.0    1.0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])

# We need to transform first character into integer in order to use the OneHotEncoder

le = preprocessing.LabelEncoder()

df['Col'] = le.fit_transform(df['Col'])

enc = OneHotEncoder()

df = DataFrame(enc.fit_transform(df).toarray())

print(df)

#      0    1    2

# 0  1.0  0.0  0.0

# 1  0.0  1.0  0.0

# 2  0.0  1.0  0.0

# 3  0.0  0.0  1.0

Hope this answer helps you! 

Also, check Machine Learning Tutorials and Machine Learning Algorithms for more details.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...