What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used for data reduction without losing its properties. Basically, it describes the composition of variances and covariances through several linear combinations of the primary variables, without missing an important part of the original information. In another term, it is about obtaining a unique set of orthogonal axes where the data has the largest variance. Its main aim is to overcome the dimensionality of the problem. The reduction of dimensionality should be such that when dropping higher dimensions, the loss of data is minimum.
Also, the interpretation of principal components
can explain associations among variables that are not visible at first glance.
It helps analyze the scattering of the observations and recognize the variables
responsible for distribution.
Now, we will use graphical representation to
understand PCA.
Let’s consider the below graph. Here, the points are scattered diagonally, showing the relationship between components of the X and Y-axis. These points show the features and attributes (Attribute 1 and Attribute 2) of a specific dataset.
Now, to reduce the complexity and dimensions of the graph, we can apply a technique. The technique is PCA, which is used to reduce the complexity of data by dimensionality reduction. So, we will rotate the axis of the graph anti-clockwise by an angle theta. After, rotating the axis, the graph looks like this:
Further, the point data set scattered over x and y-axes is now concentrated only towards the x-axis. This shows the importance of the necessary components that lie over the x-axis. So, we can drop the attributes that hold the characteristics of the y-axis component. In this way, PCA helps in dimensionality reduction to improve the performance of the Machine Learning model.
Properties of Principal Component
If we define PCA in purely technical terms, then
PCA is a precise blend of data points that are examined and jotted down to
reduce the dimension of data. To reduce the dimensionality, we try to find the principal
components. The principal components are variables or data points that are
smaller than or equivalent to the number of primary variables.
Properties of Principal Components are:
1. They are a set of primary data variables projected in different directions, similar to the properties of original variables.
2. It is commonly used in Machine Learning and Data
Science for dimensionality reduction.
3. They are orthogonal.
4. If we find PC one by one, then the variance or
the variation of the Principal Components reduces as. This means that the 1st
PC has the highest variance and the last PC has the least variance.
How to find Principal Component?
Before moving on to the computation of Principal
Components, you should have the following knowledge:
1.Variance: Variance
is used to compute the variation of the data points distributed across the
dimensionality graph. Mathematically, it is the average squared variation from
the mean value. To calculate Var(X) we use the following formula:
2.Covariance: With covariance, we can estimate the degree to which analogous components from a couple of sets of grouped data move in an identical direction. In simple words, it is used to identify the dependencies and relationships between the characteristics of datasets. Below is the formula for calculating the Cov (x, y):
where xi and yi are the value of x and y in ith dimension.
x̄ and ȳ express the mean.
3. Eigen Vectors and Eigen
Values: It is used to make alterations in data comprehensible. It can also
be understood as expanding/contracting an X-Y graph without altering the
directions. An Eigenvalue is a value indicating the variance in a
particular direction.
4. Principal Components:
The fresh set of data variables that are collected from the original data set
is called Principal Components. The new data variables are extremely meaningful
and independent. They possess all the valuable information of the original
variables.
Steps For Calculating PCA
Follow the below steps to calculate PCA:
- Standardize the data
- Compute the covariance matrix for the data
variables
- Computing the eigenvectors and eigenvalues and
order them in descending order
- Then, calculate the Principal Components
- Perform ‘dimensionality reduction’ of the data set
Let’s discuss each of the steps in detail:
Step 1: Standardize
the data
Standardization is the first step in data analysis and processing. It is scaling of data within a specific range so that the output of the corresponding variables is unbiased.
Standardization (Z) is calculated as follows:
Using this formula, all the variables
will be scaled over a common scale.
Step 2:
Compute the covariance matrix for the data variables.
We calculate the covariance matrix to recognize
the interdependencies between the variables and reduce it to improve the performance
of the model.
The following is the formula to calculate
the covariance of two-dimensional data:
Here, Cov(x, x) and Cov(y,y) is the covariance of x
and y with itself.
Cov(x, y) is the covariance of ‘x’
w.r.t ‘y’
Cov(x, y) = Cov(y, x) {By Commutative property}
Also,
If
Cov(x,y) is -ve, then x∝(1/y)
And if Cov(x,y) is +ve, x∝y
Step 3: Computing
the eigenvectors and eigenvalues
In order to determine the PCA, eigenvectors, and eigenvalues must be calculated from the covariance matrix. Therefore, for each eigenvector, there is an eigenvalue.
Also, the computation of eigenvectors
depends on the dimensions of the data.
If the data is two-dimensional, then we
would have to calculate two eigenvectors and their corresponding eigenvalues.
The main objective of eigenvectors is to calculate the Principal Components by
finding out the largest variance that exists in the dataset. The larger the
variance, the greater the information content of the data points.
Now, Eigenvalues are just the magnitude
of the eigenvectors. Both help to calculate the Principal Components.
Step 4: Calculating
the Principal Components
After we have finished with the
calculation of eigenvectors and eigenvalues, we will arrange them in descending
order. Then, the very first Principal Component would be the eigenvector with
the largest eigenvalue. For the purpose of dimensionality reduction, we can
eliminate the principal components with minor significances.
We then calculate the feature matrix
that holds all the meaningful data that consists of maximal details and figures
of the data.
Step 5: Dimensionality
reduction of the Data Set
Finally, organize the primary data with the principal components. So, to put the computed principal components in place of the original dataset, we compute the transpose of the original data and multiply it by the transpose of the derived feature vector. This is how we do the Principal Component Analysis.
H2-Applications of Principal Component Analysis
(PCA)
Principal Component Analysis (PCA) has broad applicability in the field
of Machine Learning and Data Science. It is used to create highly efficient
Machine Learning models because it minimizes the complexity of the system by
dimensionality reduction.
Some of the major application areas of Principal
Component Analysis are:
1. Face Recognition
2. Computer Vision
3. Compressing image
4. Bioinformatics
5. Unboxing highly dimensional data in the field of
banking and finance to reveal suspicious activities.
This is all about Principal Component Analysis
(PCA) and the areas where it is exactly used.