Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of performance and usage.

I found a tutorial on how to use OneHotEnocder() on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/ since the sklearn documentation wasn't too helpful on this feature. I have a feeling I'm not doing it correctly...but

Can some explain the pros and cons of using pd.dummies over sklearn.preprocessing.OneHotEncoder() and vice versa? I know that OneHotEncoder() gives you a sparse matrix but other than that I'm not sure how it is used and what the benefits are over the pandas method. Am I using it inefficiently?

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

sns.set()

%matplotlib inline

#Iris Plot

iris = load_iris()

n_samples, m_features = iris.data.shape

#Load Data

X, y = iris.data, iris.target

D_target_dummy = dict(zip(np.arange(iris.target_names.shape[0]), iris.target_names))

DF_data = pd.DataFrame(X,columns=iris.feature_names)

DF_data["target"] = pd.Series(y).map(D_target_dummy)

#sepal length (cm)  sepal width (cm) petal length (cm)  petal width (cm) \

#0                  5.1 3.5                1.4 0.2   

#1                  4.9 3.0                1.4 0.2   

#2                  4.7 3.2                1.3 0.2   

#3                  4.6 3.1                1.5 0.2   

#4                  5.0 3.6                1.4 0.2   

#5                  5.4 3.9                1.7 0.4   

DF_dummies = pd.get_dummies(DF_data["target"])

#setosa  versicolor  virginica

#0         1 0          0

#1         1 0          0

#2         1 0          0

#3         1 0          0

#4         1 0          0

#5         1 0          0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def f1(DF_data):

    Enc_ohe, Enc_label = OneHotEncoder(), LabelEncoder()

    DF_data["Dummies"] = Enc_label.fit_transform(DF_data["target"])

    DF_dummies2 = pd.DataFrame(Enc_ohe.fit_transform(DF_data[["Dummies"]]).todense(), columns = Enc_label.classes_)

    return(DF_dummies2)

%timeit pd.get_dummies(DF_data["target"])

#1000 loops, best of 3: 777 µs per loop

%timeit f1(DF_data)

#100 loops, best of 3: 2.91 ms per loop

1 Answer

0 votes
by (33.1k points)

OneHotEncoder: It cannot process string values directly. If your input features are strings, then you should first map them into integers.

Code for OneHotEncoder in scikit learn:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown="ignore")

encoder.fit(X_train)    # Assume for simplicity all features are categorical.

X_train = encoder.transform(X_train)

X_test = encoder.transform(X_test)


 

Pandas.get_dummies: This method converts string columns into one-hot representation unless particular columns are specified.

Hope this answer helps.

Learn about Scikit Learn with the help of this Scikit Learn Tutorial.

Browse Categories

...