0 votes
1 view
in Machine Learning by (5.7k points)

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the data frame has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder object that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality, I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

Code:

import pandas

from sklearn import preprocessing

df = pandas.DataFrame({

   'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],

   'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],

   'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',

                'New_York']

})

le = preprocessing.LabelEncoder()

le.fit(df)

Output:

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?

1 Answer

0 votes
by (14.9k points)

Label Encoder and One Hot Encoder are classes of the SciKit Learn library in Python.

Label Encoding

It converts categorical text data into model-understandable numerical data, we use the Label Encoder class. For label encoding, import the LabelEncoder class from the sklearn library, then fit and transform your data.

For example:

from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

x[:, 0] = labelencoder.fit_transform(x[:, 0])

One Hot Encoder

It takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

For example:

from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categorical_features = [0])

x = onehotencoder.fit_transform(x).toarray()

For your problem, you can use OneHotEncoder to encode features of your dataset.

OneHotEncoder().fit_transform(df)

Hope this answer helps.

...