0 votes
1 view
in Machine Learning by (15.7k points)

I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of performance and usage.

I found a tutorial on how to use OneHotEnocder() on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/ since the sklearn documentation wasn't too helpful on this feature. I have a feeling I'm not doing it correctly...but

Can some explain the pros and cons of using pd.dummies over sklearn.preprocessing.OneHotEncoder() and vice versa? I know that OneHotEncoder() gives you a sparse matrix but other than that I'm not sure how it is used and what the benefits are over the pandas method. Am I using it inefficiently?

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

sns.set()

%matplotlib inline

#Iris Plot

iris = load_iris()

n_samples, m_features = iris.data.shape

#Load Data

X, y = iris.data, iris.target

D_target_dummy = dict(zip(np.arange(iris.target_names.shape[0]), iris.target_names))

DF_data = pd.DataFrame(X,columns=iris.feature_names)

DF_data["target"] = pd.Series(y).map(D_target_dummy)

#sepal length (cm)  sepal width (cm) petal length (cm)  petal width (cm) \

#0                  5.1 3.5                1.4 0.2   

#1                  4.9 3.0                1.4 0.2   

#2                  4.7 3.2                1.3 0.2   

#3                  4.6 3.1                1.5 0.2   

#4                  5.0 3.6                1.4 0.2   

#5                  5.4 3.9                1.7 0.4   

DF_dummies = pd.get_dummies(DF_data["target"])

#setosa  versicolor  virginica

#0         1 0          0

#1         1 0          0

#2         1 0          0

#3         1 0          0

#4         1 0          0

#5         1 0          0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def f1(DF_data):

    Enc_ohe, Enc_label = OneHotEncoder(), LabelEncoder()

    DF_data["Dummies"] = Enc_label.fit_transform(DF_data["target"])

    DF_dummies2 = pd.DataFrame(Enc_ohe.fit_transform(DF_data[["Dummies"]]).todense(), columns = Enc_label.classes_)

    return(DF_dummies2)

%timeit pd.get_dummies(DF_data["target"])

#1000 loops, best of 3: 777 ┬Ás per loop

%timeit f1(DF_data)

#100 loops, best of 3: 2.91 ms per loop

1 Answer

0 votes
by (33.2k points)

OneHotEncoder: It cannot process string values directly. If your input features are strings, then you should first map them into integers.

Code for OneHotEncoder in scikit learn:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown="ignore")

encoder.fit(X_train)    # Assume for simplicity all features are categorical.

X_train = encoder.transform(X_train)

X_test = encoder.transform(X_test)


 

Pandas.get_dummies: This method converts string columns into one-hot representation unless particular columns are specified.

Hope this answer helps.

Learn about Scikit Learn with the help of this Scikit Learn Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...