Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (17.6k points)

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) 

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

Any help would be very welcome

1 Answer

0 votes
by (41.4k points)

Use this below code for imputing categorical missing values in scikit-learn:

import pandas as pd

import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):

        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 

        in column.

        Columns of other types are imputed with mean of column.


    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]

            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],


        return self

    def transform(self, X, y=None):

        return X.fillna(self.fill)

data = [

    ['a', 1, 2],

    ['b', 1, 1],

    ['b', 2, 2],

    [np.nan, np.nan, np.nan]


X = pd.DataFrame(data)

xt = DataFrameImputer().fit_transform(X)





This gives the output:


     0   1 2

0    a 1   2

1    b 1   1

2    b 2   2

3  NaN NaN NaN


   0         1 2

0  a 1.000000  2.000000

1  b 1.000000  1.000000

2  b 2.000000  2.000000

3  b 1.333333  1.666667

Browse Categories