2 views

I know that principal component analysis does an SVD on a matrix and then generates an eigenvalue matrix. To select the principal components we have to take only the first few eigenvalues. Now, how do we decide on the number of eigenvalues that we should take from the eigenvalue matrix?

by (33.1k points)

There are many algorithms and statistical processes to analyze how many principal components needed for training. To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing Principal Component Analysis (PCA). Are you doing it for reducing storage requirements, reducing dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. a number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., a little variance is gained by retaining additional eigenvalues).

Another dimensionality reduction technique you can use is the Linear Discriminant Analysis(LDA). It will also get appropriate highly correlated features. It consists of statistical properties of your data, calculated for each class. For a single input variable (x) this is the mean and the variance of the variable for each class. For multiple variables, this is the same properties calculated over the multivariate Gaussian, namely the means and the covariance matrix.

These statistical properties are estimated from your data and plug into the LDA equations to make predictions. These are the model values that you would save to file for your model.