Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)
edited by

I have a machine learning classification problem with 80% categorical variables. Must I use one-hot encoding if I want to use some classifier for the classification? Can I pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:

  1. I read the training file:
    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)

  2. I change the type of the categorical features to 'category':
    non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt']
    for categorical_feature in list(train_small.columns):
    if categorical_feature not in non_categorial_features:
    train_small[categorical_feature] = train_small[categorical_feature].astype('category')

  3. I use one-hot encoding:
    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

The problem is that the 3'rd part often gets stuck, although I am using a strong machine.

Thus, without the one-hot encoding, I can't do any feature selection, for determining the importance of the features.

What do you recommend?


1 Answer

+1 vote
by (33.1k points)
selected by
Best answer

Python has a vast number of functions, classes, and libraries. For this problem, you can use python’s computational library - Pandas or you can use Scikit-learn.

Using Pandas:

In pandas, we use get_dummies to encode the values.

For Example

>>> df = pd.DataFrame({'Name':['John Smith', 'Mary Brown'],

                    'Gender':['M', 'F'], 'Smoker':['Y', 'N']})

>>> print(df)

Gender        Name Smoker

0      M John Smith      Y

1      F Mary Brown      N

>>> df_with_dummies = pd.get_dummies(df, columns=['Gender', 'Smoker'])

>>> print(df_with_dummies)

       Name Gender_F  Gender_M Smoker_N  Smoker_Y

0  John Smith       0.0 1.0   0.0 1.0

1  Mary Brown       1.0 0.0   1.0 0.0

Using Sckiti-Learn:

In Scikit-learn, we can use one-hot encoder to encode the values. After encoding, we can use get_feature_names to get the names of the features.

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')

X = [['Male', 1], ['Female', 3], ['Female', 2]]

enc.transform([['Female', 1], ['Male', 4]]).toarray()

This code will encode the values using fit function and encodes the new values using a transform function. More you can find here.

Browse Categories