PCA first or normalization first?

Question

1 Answer

Anurag · Answer 1 · 2019-07-17T14:57:05+0000

You should perform data normalization before doing PCA because it makes PCA work faster with more accuracy.

For example, consider a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;

Now we will implement PCA to find principal components (features with high co-relation):

>> wts=pca(X)
wts =
0.6659 0.7461
-0.7461 0.6659

To scale the first feature of the data set by 100:

>> Y = X;
>> Y(:,1) = 100 * Y(:,1);

Here, the principal components are aligned with the coordinate axes:

>> wts=pca(Y)
wts =
1.0000 0.0056
-0.0056 1.0000

There are two methods to resolve it:

Rescale the data:

>> Ynorm = bsxfun( df, Y, std(Y))

To get PCA results:

>> wts = pca(Ynorm)
wts =
-0.7125 -0.7016
0.7016 -0.7125

They might be different from the PCA performed on original data.

Second, perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr')
wts =
0.7071 0.7071
-0.7071 0.7071

This part is might similar to standardizing the data by subtracting the mean and dividing the standard deviation.

Hope this answer helps.