Principal Component Analysis (PCA)| What is PCA?

What is Principal Component Analysis?

Principal Component Analysis (PCA) is a statistical technique used for data reduction without losing its properties. Basically, it describes the composition of variances and covariances through several linear combinations of the primary variables, without missing an important part of the original information. In another term, it is about obtaining a unique set of orthogonal axes where the data has the largest variance. Its main aim is to overcome the dimensionality of the problem. The reduction of dimensionality should be such that when dropping higher dimensions, the loss of data is minimum.

Also, the interpretation of principal components can explain associations among variables that are not visible at first glance. It helps analyze the scattering of the observations and recognize the variables responsible for distribution.

Now, we will use graphical representation to understand PCA.

Let’s consider the below graph. Here, the points are scattered diagonally, showing the relationship between components of the X and Y-axis. These points show the features and attributes (Attribute 1 and Attribute 2) of a specific dataset.

Now, to reduce the complexity and dimensions of the graph, we can apply a technique. The technique is PCA, which is used to reduce the complexity of data by dimensionality reduction. So, we will rotate the axis of the graph anti-clockwise by an angle theta. After, rotating the axis, the graph looks like this:

Further, the point data set scattered over x and y-axes is now concentrated only towards the x-axis. This shows the importance of the necessary components that lie over the x-axis. So, we can drop the attributes that hold the characteristics of the y-axis component. In this way, PCA helps in dimensionality reduction to improve the performance of the Machine Learning model.

Transform Your Skills in Data Analytics

Data Analytics Online Training

Explore Program

Properties of Principal Component

If we define PCA in purely technical terms, then PCA is a precise blend of data points that are examined and jotted down to reduce the dimension of data. To reduce the dimensionality, we try to find the principal components. The principal components are variables or data points that are smaller than or equivalent to the number of primary variables.

Properties of Principal Components are:

1. They are a set of primary data variables projected in different directions, similar to the properties of original variables.

2. It is commonly used in Machine Learning and Data Science for dimensionality reduction.

3. They are orthogonal.

4. If we find PC one by one, then the variance or
the variation of the Principal Components reduces as. This means that the 1st
PC has the highest variance and the last PC has the least variance.

How to find Principal Component?

Before moving on to the computation of Principal Components, you should have the following knowledge:

1.Variance: Variance is used to compute the variation of the data points distributed across the dimensionality graph. Mathematically, it is the average squared variation from the mean value. To calculate Var(X) we use the following formula:

2.Covariance: With covariance, we can estimate the degree to which analogous components from a couple of sets of grouped data move in an identical direction. In simple words, it is used to identify the dependencies and relationships between the characteristics of datasets. Below is the formula for calculating the Cov (x, y):

where xi and yi are the value of x and y in ith dimension.

x̄ and ȳ express the mean.

3. Eigen Vectors and Eigen Values: It is used to make alterations in data comprehensible. It can also be understood as expanding/contracting an X-Y graph without altering the directions. An Eigenvalue is a value indicating the variance in a particular direction.

4. Principal Components: The fresh set of data variables that are collected from the original data set is called Principal Components. The new data variables are extremely meaningful and independent. They possess all the valuable information of the original variables.

Get 100% Hike!

Master Most in Demand Skills Now!

Steps For Calculating PCA

Follow the below steps to calculate PCA:

Standardize the data
Compute the covariance matrix for the data variables
Computing the eigenvectors and eigenvalues and order them in descending order
Then, calculate the Principal Components
Perform ‘dimensionality reduction’ of the data set

Let’s discuss each of the steps in detail:

Step 1: Standardize the data

Standardization is the first step in data analysis and processing. It is scaling of data within a specific range so that the output of the corresponding variables is unbiased.

Standardization (Z) is calculated as follows:

Using this formula, all the variables
will be scaled over a common scale.

Step 2: Compute the covariance matrix for the data variables.

We calculate the covariance matrix to recognize the interdependencies between the variables and reduce it to improve the performance of the model.

The following is the formula to calculate the covariance of two-dimensional data:

Here, Cov(x, x) and Cov(y,y) is the covariance of x
and y with itself.

Cov(x, y) is the covariance of ‘x’ w.r.t ‘y’

Cov(x, y) = Cov(y, x) {By Commutative property}

Also,

If Cov(x,y) is -ve, then x∝(1/y)

And if Cov(x,y) is +ve, x∝y

Step 3: Computing
the eigenvectors and eigenvalues

In order to determine the PCA, eigenvectors, and eigenvalues must be calculated from the covariance matrix. Therefore, for each eigenvector, there is an eigenvalue.

Also, the computation of eigenvectors depends on the dimensions of the data.

If the data is two-dimensional, then we would have to calculate two eigenvectors and their corresponding eigenvalues. The main objective of eigenvectors is to calculate the Principal Components by finding out the largest variance that exists in the dataset. The larger the variance, the greater the information content of the data points.

Now, Eigenvalues are just the magnitude
of the eigenvectors. Both help to calculate the Principal Components.

Become a Game-Changer in Data Analytics

Data Analytics Training

Explore Program

Step 4: Calculating the Principal Components

After we have finished with the calculation of eigenvectors and eigenvalues, we will arrange them in descending order. Then, the very first Principal Component would be the eigenvector with the largest eigenvalue. For the purpose of dimensionality reduction, we can eliminate the principal components with minor significances.

We then calculate the feature matrix
that holds all the meaningful data that consists of maximal details and figures
of the data.

Step 5: Dimensionality reduction of the Data Set

Finally, organize the primary data with the principal components. So, to put the computed principal components in place of the original dataset, we compute the transpose of the original data and multiply it by the transpose of the derived feature vector. This is how we do the Principal Component Analysis.

H2-Applications of Principal Component Analysis (PCA)

Principal Component Analysis (PCA) has broad applicability in the field of Machine Learning and Data Science. It is used to create highly efficient Machine Learning models because it minimizes the complexity of the system by dimensionality reduction.

Some of the major application areas of Principal Component Analysis are:

1. Face Recognition

2. Computer Vision

3. Compressing image

4. Bioinformatics

5. Unboxing highly dimensional data in the field of banking and finance to reveal suspicious activities.

This is all about Principal Component Analysis
(PCA) and the areas where it is exactly used.