2 views

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:

imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0)

imputer = Imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.

Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean? I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.

Thank you.

by (33.1k points)

The Imputer is used to fill missing values with some statistics (e.g. mean, median, ...) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.

For example:

from sklearn.preprocessing import Imputer

obj = Imputer(strategy='mean')

obj.fit([[1, 2, 3], [2, 3, 4]])

print(obj.statistics_)

# array([ 1.5,  2.5, 3.5])

X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])

print(X)

# array([[ 4. ,  2.5, 6. ],

#        [ 5. , 6. ,  3.5]])

You can follow both steps in one if your train and test data are identical:

Using fit_transform:

X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])

print(X)

# array([[ 1. ,  2. , 4. ],

#        [ 2. , 3. ,  4. ]])

The data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.