Label encoding across multiple columns in scikit-learn

Question

asked Jun 19, 2019 in Machine Learning by ParasSharma1 (19k points)

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the data frame has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder object that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality, I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

Code:

import pandas
from sklearn import preprocessing
df = pandas.DataFrame({
   'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
   'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
   'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
                'New_York']
})
le = preprocessing.LabelEncoder()
le.fit(df)

Output:

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?

4 Answers

Anurag · Answer 1 · 2019-06-19T08:58:34+0000

Label Encoder and One Hot Encoder are classes of the SciKit Learn library in Python.

Label Encoding

It converts categorical text data into model-understandable numerical data, we use the Label Encoder class. For label encoding, import the LabelEncoder class from the sklearn library, then fit and transform your data.

For example:

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])

One Hot Encoder

It takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

For example:

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

For your problem, you can use OneHotEncoder to encode features of your dataset.

OneHotEncoder().fit_transform(df)

Hope this answer helps.

Study Natural Language Processing comprehensively with the help of this video tutorial:

If you wish to learn more about Python, visit Python tutorial and Python course by Intellipaat.

This is a good way to transform data once, but what if I want to reuse this transform on a validation set. you would have to fit_transform again and issues could arise such as my new data set not having all the categories for all variables. — Ashok, Aug 17, 2019

kodee · Answer 2 · 2019-08-07T14:12:27+0000

You can do it like this:

df.apply(LabelEncoder().fit_transform)

The recommended way for scikit-learn 0.20 is this:

OneHotEncoder().fit_transform(df)

For inverse_transform and transform, you should do it like this:

from collections import defaultdict
d = defaultdict(LabelEncoder)

Now you can retain all columns LabelEncoder as a dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

Thanks for the explanation.
However, for the purpose of a few classification tasks. You could use
pandas.get_dummies(input_df)
This helps us to an input data frame with categorical data and return a data frame with binary values. — chandra, Aug 9, 2019
This is the right solution, but in this case, how can we apply inverse transform? — Prabhpreet Kaur, Aug 10, 2019

vinita · Answer 3 · 2019-08-08T06:23:39+0000

You can easily do this through the following syntax:

df.apply(LabelEncoder().fit_transform)

In scikit-learn 0.20, the recommended way is the following:

OneHotEncoder().fit_transform(df)

as the OneHotEncoder presently supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

Kasheeka · Answer 4 · 2019-08-09T11:43:01+0000

I don't think you'll need LabelEncoder in order to encode a pandas DataFrame.

You can actually go ahead and transform the columns to categoricals and then retrieve their codes. The below code applies this process to all columns and wraps the result back into a dataframe of the same shape with indistinguishable indices and column names.

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
location owner pets
0 1 1 0
1 0 2 1
2 0 0 0
3 1 1 2
4 1 3 1
5 0 2 1

In order to build a new mapping dictionary, you could try to enumerate the categories while using a dictionary comprehension:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)}
for col in df}
{'location': {0: 'New_York', 1: 'San_Diego'},
'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

Label encoding across multiple columns in scikit-learn

Label encoding across multiple columns in scikit-learn

Please log in or register to add a comment.

Please log in or register to answer this question.

4 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions