0 votes
1 view
in Machine Learning by (12.5k points)

I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns.

from sklearn.decomposition import PCA

pca = PCA(n_components=30)

X_train_pca = pca.fit_transform(X_train)

Now, when I want to project the eigenvectors onto feature space, I must do following:

comp = pca.components_ #30x104

com_tr = np.transpose(pca.components_) #104x30

proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30

But I am hesitating with this step, because Scikit documentation says:

components_: array, [n_components, n_features]

Principal axes in feature space, representing the directions of maximum variance in the data.

It seems to me, that it is already projected, but when I checked the source code, it returns only the eigenvectors.

What is the right way how to project it?

Ultimately, I am aiming to calculate the MSE of reconstruction.

""" Reconstruct """

recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104

"""  MSE Error """

print "MSE = %.6G" %(np.mean((X_train - recon)**2))

1 Answer

0 votes
by (32.8k points)
edited by

Use the following code:

proj = pca.inverse_transform(X_train_pca)

Here you do not have to worry about how to do the multiplications.

The output after pca.fit_transform or pca.transform is usually called the "loadings" for each sample, meaning how much of each component you need to describe it best using a linear combination of the components _.

The projection you are aiming at is back in the original signal space. This means that you need to go back into signal space using the components and the loadings.

pca.fit estimates the components:

from sklearn.decomposition import PCA

import numpy as np

from numpy.testing import assert_array_almost_equal

#Should this variable be X_train instead of Xtrain?

X_train = np.random.randn(100, 50)

pca = PCA(n_components=30)

pca.fit(X_train)

U, S, VT = np.linalg.svd(X_train - X_train.mean(0))

assert_array_almost_equal(VT[:30], pca.components_)

pca.transform calculates the loadings as you describe

X_train_pca = pca.transform(X_train)

X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)

assert_array_almost_equal(X_train_pca, X_train_pca2)

pca.inverse_transform obtains the projection onto components in signal space you are interested in

X_projected = pca.inverse_transform(X_train_pca)

X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_

assert_array_almost_equal(X_projected, X_projected2)

You can now evaluate the projection loss

loss = ((X_train - X_projected) ** 2).mean()

Hope this answer helps you!

If you want to know more about Machine Learning then watch this video:

...