2 views

When doing regression or classification, what is the correct (or better) way to preprocess the data?

1. Normalize the data -> PCA -> training

2. PCA -> normalize PCA output -> training

3. Normalize the data -> PCA -> normalize PCA output -> training

Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.

by (33.1k points)

You should perform data normalization before doing PCA because it makes PCA work faster with more accuracy.

For example, consider a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1];

>> A = chol(rho);

>> X = randn(100,2) * A;

Now we will implement PCA to find principal components (features with high co-relation):

>> wts=pca(X)

wts =

0.6659    0.7461

-0.7461    0.6659

To scale the first feature of the data set by 100:

>> Y = X;

>> Y(:,1) = 100 * Y(:,1);

Here, the principal components are aligned with the coordinate axes:

>> wts=pca(Y)

wts =

1.0000    0.0056

-0.0056    1.0000

There are two methods to resolve it:

Rescale the data:

>> Ynorm = bsxfun( df, Y, std(Y))

To get PCA results:

>> wts = pca(Ynorm)

wts =

-0.7125   -0.7016

0.7016   -0.7125

They might be different from the PCA performed on original data.

Second, perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr')

wts =

0.7071    0.7071

-0.7071    0.7071

This part is might similar to standardizing the data by subtracting the mean and dividing the standard deviation.