These four encoders can be split into two categories:
- Encode labels into categorical variables: Using Pandas factorize and scikit-learn LabelEncoder. The result will have 1 dimension.
- Encode categorical variable into dummy/indicator (binary) variables: Pandas get_dummies and scikit-learn OneHotEncoder. The result will have n dimensions, one by the distinct value of the encoded categorical variable.
The major difference between pandas and scikit-learn encoders is that scikit-learn encoders are built to be used in scikit-learn pipelines with the fit and transform methods.
Encode labels into categorical variables
Pandas factorize and scikit-learn LabelEncoder belong to the first category. They can be used to create categorical variables.
For example:
from sklearn import preprocessing
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df['Fact'] = pd.factorize(df['Col'])[0]
le = preprocessing.LabelEncoder()
df['Lab'] = le.fit_transform(df['Col'])
print(df)
# Col Fact Lab
# 0 A 0 0
# 1 B 1 1
# 2 B 1 1
# 3 C 2 2
Encode categorical variable into dummy/indicator (binary) variables:
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
df = pd.get_dummies(df)
print(df)
# Col_A Col_B Col_C
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
df = DataFrame(['A', 'B', 'B', 'C'], columns=['Col'])
# We need to transform first character into integer in order to use the OneHotEncoder
le = preprocessing.LabelEncoder()
df['Col'] = le.fit_transform(df['Col'])
enc = OneHotEncoder()
df = DataFrame(enc.fit_transform(df).toarray())
print(df)
# 0 1 2
# 0 1.0 0.0 0.0
# 1 0.0 1.0 0.0
# 2 0.0 1.0 0.0
# 3 0.0 0.0 1.0
Hope this answer helps you!
Also, check Machine Learning Tutorials and Machine Learning Algorithms for more details.