How can I one hot encode in Python?

Question

asked Jun 14, 2019 in Machine Learning by ParasSharma1 (19k points)
edited Jun 18, 2019 by ParasSharma1

I have a machine learning classification problem with 80% categorical variables. Must I use one-hot encoding if I want to use some classifier for the classification? Can I pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:

I read the training file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
I use one-hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

The problem is that the 3'rd part often gets stuck, although I am using a strong machine.

Thus, without the one-hot encoding, I can't do any feature selection, for determining the importance of the features.

What do you recommend?

closed

1 Answer

answered Jun 17, 2019 by Anurag (33.1k points)
selected Jun 18, 2019 by ParasSharma1

Best answer

Python has a vast number of functions, classes, and libraries. For this problem, you can use python’s computational library - Pandas or you can use Scikit-learn.

Using Pandas:

In pandas, we use get_dummies to encode the values.

For Example

>>> df = pd.DataFrame({'Name':['John Smith', 'Mary Brown'],
'Gender':['M', 'F'], 'Smoker':['Y', 'N']})
>>> print(df)

Gender Name Smoker

0 M John Smith Y

1 F Mary Brown N

>>> df_with_dummies = pd.get_dummies(df, columns=['Gender', 'Smoker'])
>>> print(df_with_dummies)

Name Gender_F Gender_M Smoker_N Smoker_Y

0 John Smith 0.0 1.0 0.0 1.0

1 Mary Brown 1.0 0.0 1.0 0.0

Using Sckiti-Learn:

In Scikit-learn, we can use one-hot encoder to encode the values. After encoding, we can use get_feature_names to get the names of the features.

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.transform([['Female', 1], ['Male', 4]]).toarray()

This code will encode the values using fit function and encodes the new values using a transform function. More you can find here.

How can I one hot encode in Python?

How can I one hot encode in Python?

Please log in or register to add a comment.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions