+1 vote
2 views

In scikit-learn, all estimators have a fit() method, and depending on whether they are supervised or unsupervised, they also have a predict() or transform() method.

I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard:

fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.

In this context, what is meant by both fitting data and transforming data?

+1 vote
by (6.8k points)

To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation.

x′=x−μ/σ

You do that on the training set of data. But then you've got to use a similar transformation to your testing set (e.g. in cross-validation), or to newly obtained examples before forecast. But you have to use the same two parameters μ and σ(values) that you used for centering the training set.

Hence, each sklearn's transform's fit() simply calculates the parameters (e.g. μ and σ just in case of StandardScaler) and saves them as an enclosed objects state. Afterward, you'll call its transform() methodology to use the transformation to a selected set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set xx, but it also returns a transformed x′. Internally, it simply calls first fit() so transform()on a piece of similar information.

The following explanation is based on fit_transform of Imputer class, but the idea is the same for fit_transform of other scikit_learn classes like MinMaxScaler.

________________________________________

transform replaces the missing values with a number. By default this range is that the means of columns of some information that you select.

Consider the following example:

imp = Imputer()

# calculating the means

imp.fit([[1, 3], [np.nan, 2], [8, 5.5]])

Now the imputer have learned to use a mean (1+8)/2 = 4.5 for the first column and mean (2+3+5.5)/3 = 3.5 for the second column when it gets applied to a two-column data:

X = [[np.nan, 11],

[4, np.nan],

[8, 2],

[np.nan, 1]]

print(imp.transform(X))

we get

[[4.5, 11],

[4, 3.5],

[8, 2],

[4.5, 1]]

So by fit the imputer calculates the means of columns from some data, and by transforming it applies those means to some data (which is just replacing missing values with the means). If both these data are the same (i.e. the information for shrewd the means that and the information meaning square measure applied to) you'll use match_transform that is largely a fit followed by a rework.

A better detailing on this method can be grabbed through Machine Learning Course.

+1 vote