Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

When doing regression or classification, what is the correct (or better) way to preprocess the data?

  1. Normalize the data -> PCA -> training

  2. PCA -> normalize PCA output -> training

  3. Normalize the data -> PCA -> normalize PCA output -> training

Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.

1 Answer

0 votes
by (33.1k points)

You should perform data normalization before doing PCA because it makes PCA work faster with more accuracy. 

For example, consider a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1];

>> A = chol(rho);

>> X = randn(100,2) * A;

Now we will implement PCA to find principal components (features with high co-relation):

>> wts=pca(X)

wts =

    0.6659    0.7461

   -0.7461    0.6659

To scale the first feature of the data set by 100:

>> Y = X;

>> Y(:,1) = 100 * Y(:,1);

Here, the principal components are aligned with the coordinate axes:

>> wts=pca(Y)

wts =

    1.0000    0.0056

   -0.0056    1.0000

There are two methods to resolve it:

Rescale the data:

>> Ynorm = bsxfun( df, Y, std(Y))

To get PCA results:

>> wts = pca(Ynorm)

wts =

   -0.7125   -0.7016

    0.7016   -0.7125

They might be different from the PCA performed on original data.

Second, perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr')

wts =

    0.7071    0.7071

   -0.7071    0.7071

This part is might similar to standardizing the data by subtracting the mean and dividing the standard deviation.

Hope this answer helps.

Browse Categories

...