+2 votes
1 view
in Machine Learning by (12.5k points)

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the data frame has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder object that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality, I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

Code:

import pandas

from sklearn import preprocessing

df = pandas.DataFrame({

   'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],

   'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],

   'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',

                'New_York']

})

le = preprocessing.LabelEncoder()

le.fit(df)

Output:

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?

4 Answers

+3 votes
by (32.8k points)
edited by

Label Encoder and One Hot Encoder are classes of the SciKit Learn library in Python.

Label Encoding

It converts categorical text data into model-understandable numerical data, we use the Label Encoder class. For label encoding, import the LabelEncoder class from the sklearn library, then fit and transform your data.

For example:

from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

x[:, 0] = labelencoder.fit_transform(x[:, 0])

One Hot Encoder

It takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

For example:

from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categorical_features = [0])

x = onehotencoder.fit_transform(x).toarray()

For your problem, you can use OneHotEncoder to encode features of your dataset.

OneHotEncoder().fit_transform(df)

Hope this answer helps.

Study Natural Language Processing comprehensively with the help of this video tutorial:

by (36.7k points)
well explained.
by (36.7k points)
This is a good way to transform data once, but what if I want to reuse this transform on a validation set. you would have to fit_transform again and issues could arise such as my new data set not having all the categories for all variables.
+3 votes
by (42k points)

You can do it like this:

df.apply(LabelEncoder().fit_transform)

The recommended way for scikit-learn 0.20 is this:

OneHotEncoder().fit_transform(df)

For inverse_transform and transform, you should do it like this:

from collections import defaultdict

d = defaultdict(LabelEncoder)

Now you can retain all columns LabelEncoder as a dictionary.

# Encoding the variable

fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded

fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data

df.apply(lambda x: d[x.name].transform(x))

by (23.6k points)
Thanks for the explanation.
However, for the purpose of a few classification tasks. You could use
pandas.get_dummies(input_df)
This helps us to an input data frame with categorical data and return a data frame with binary values.
by (47.9k points)
This is the right solution, but in this case, how can we apply inverse transform?
by (7.4k points)
Thank you for the explanation.
by (66.5k points)
Looking for this type of explanation thanks
+2 votes
by (44.6k points)

You can easily do this through the following syntax:

df.apply(LabelEncoder().fit_transform)

In scikit-learn 0.20, the recommended way is the following:

OneHotEncoder().fit_transform(df)

as the OneHotEncoder presently supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

+2 votes
by (28.6k points)

I don't think you'll need LabelEncoder in order to encode a pandas DataFrame.

You can actually go ahead and transform the columns to categoricals and then retrieve their codes. The below code applies this process to all columns and wraps the result back into a dataframe of the same shape with indistinguishable indices and column names.

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)

   location  owner  pets

0         1      1     0

1         0      2     1

2         0      0     0

3         1      1     2

4         1      3     1

5         0      2     1

In order to  build a new mapping dictionary, you could try to enumerate the categories while using a dictionary comprehension:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 

     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},

 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},

 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

by (27.7k points)
Nice explanation! thanks @kasheeka
...