Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of performance and usage.

I found a tutorial on how to use OneHotEnocder() on https://xgdgsc.wordpress.com/2015/03/20/note-on-using-onehotencoder-in-scikit-learn-to-work-on-categorical-features/ since the sklearn documentation wasn't too helpful on this feature. I have a feeling I'm not doing it correctly...but

Can some explain the pros and cons of using pd.dummies over sklearn.preprocessing.OneHotEncoder() and vice versa? I know that OneHotEncoder() gives you a sparse matrix but other than that I'm not sure how it is used and what the benefits are over the pandas method. Am I using it inefficiently?

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

sns.set()

%matplotlib inline

#Iris Plot

iris = load_iris()

n_samples, m_features = iris.data.shape

#Load Data

X, y = iris.data, iris.target

D_target_dummy = dict(zip(np.arange(iris.target_names.shape[0]), iris.target_names))

DF_data = pd.DataFrame(X,columns=iris.feature_names)

DF_data["target"] = pd.Series(y).map(D_target_dummy)

#sepal length (cm)  sepal width (cm) petal length (cm)  petal width (cm) \

#0                  5.1 3.5                1.4 0.2   

#1                  4.9 3.0                1.4 0.2   

#2                  4.7 3.2                1.3 0.2   

#3                  4.6 3.1                1.5 0.2   

#4                  5.0 3.6                1.4 0.2   

#5                  5.4 3.9                1.7 0.4   

DF_dummies = pd.get_dummies(DF_data["target"])

#setosa  versicolor  virginica

#0         1 0          0

#1         1 0          0

#2         1 0          0

#3         1 0          0

#4         1 0          0

#5         1 0          0

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def f1(DF_data):

    Enc_ohe, Enc_label = OneHotEncoder(), LabelEncoder()

    DF_data["Dummies"] = Enc_label.fit_transform(DF_data["target"])

    DF_dummies2 = pd.DataFrame(Enc_ohe.fit_transform(DF_data[["Dummies"]]).todense(), columns = Enc_label.classes_)

    return(DF_dummies2)

%timeit pd.get_dummies(DF_data["target"])

#1000 loops, best of 3: 777 µs per loop

%timeit f1(DF_data)

#100 loops, best of 3: 2.91 ms per loop

1 Answer

0 votes
by (33.1k points)

OneHotEncoder: It cannot process string values directly. If your input features are strings, then you should first map them into integers.

Code for OneHotEncoder in scikit learn:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown="ignore")

encoder.fit(X_train)    # Assume for simplicity all features are categorical.

X_train = encoder.transform(X_train)

X_test = encoder.transform(X_test)


 

Pandas.get_dummies: This method converts string columns into one-hot representation unless particular columns are specified.

Hope this answer helps.

Learn about Scikit Learn with the help of this Scikit Learn Tutorial.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...