• Articles
  • Interview Questions

A Brief Introduction to Principal Component Analysis

A Brief Introduction to Principal Component Analysis

What is Principal Component Analysis?

Principal Component Analysis (PCA) is a statistical technique used for data reduction without losing its properties. Basically, it describes the composition of variances and covariances through several linear combinations of the primary variables, without missing an important part of the original information. In another term, it is about obtaining a unique set of orthogonal axes where the data has the largest variance. Its main aim is to overcome the dimensionality of the problem. The reduction of dimensionality should be such that when dropping higher dimensions, the loss of data is minimum. 

Also, the interpretation of principal components can explain associations among variables that are not visible at first glance. It helps analyze the scattering of the observations and recognize the variables responsible for distribution.

Now, we will use graphical representation to understand PCA.

Let’s consider the below graph. Here, the points are scattered diagonally, showing the relationship between components of the X and Y-axis. These points show the features and attributes (Attribute 1 and Attribute 2) of a specific dataset. 

PCA Graph

Now, to reduce the complexity and dimensions of the graph, we can apply a technique. The technique is PCA, which is used to reduce the complexity of data by dimensionality reduction. So, we will rotate the axis of the graph anti-clockwise by an angle theta. After, rotating the axis, the graph looks like this:

Reduced graph of PCA

Further, the point data set scattered over x and y-axes is now concentrated only towards the x-axis. This shows the importance of the necessary components that lie over the x-axis. So, we can drop the attributes that hold the characteristics of the y-axis component. In this way, PCA helps in dimensionality reduction to improve the performance of the Machine Learning model.  

Properties of Principal Component

If we define PCA in purely technical terms, then PCA is a precise blend of data points that are examined and jotted down to reduce the dimension of data. To reduce the dimensionality, we try to find the principal components. The principal components are variables or data points that are smaller than or equivalent to the number of primary variables. 

Properties of Principal Components are:

1. They are a set of primary data variables projected in different directions, similar to the properties of original variables.

2. It is commonly used in Machine Learning and Data Science for dimensionality reduction.

3. They are orthogonal. 

4. If we find PC one by one, then the variance or the variation of the Principal Components reduces as. This means that the 1st PC has the highest variance and the last PC has the least variance.

Learn Machine Learning from experts, click here to know more about this Machine Learning Training in Hyderabad!

How to find Principal Component?

Before moving on to the computation of Principal Components, you should have the following knowledge:

1.Variance: Variance is used to compute the variation of the data points distributed across the dimensionality graph. Mathematically, it is the average squared variation from the mean value. To calculate Var(X) we use the following formula:

Variance

2.Covariance: With covariance, we can estimate the degree to which analogous components from a couple of sets of grouped data move in an identical direction. In simple words, it is used to identify the dependencies and relationships between the characteristics of datasets. Below is the formula for calculating the Cov (x, y):

Covariance

where xi and yi are the value of x and y in ith dimension.

x̄ and ȳ express the mean.

3. Eigen Vectors and Eigen Values: It is used to make alterations in data comprehensible. It can also be understood as expanding/contracting an X-Y graph without altering the directions. An Eigenvalue is a value indicating the variance in a particular direction.

4. Principal Components: The fresh set of data variables that are collected from the original data set is called Principal Components. The new data variables are extremely meaningful and independent. They possess all the valuable information of the original variables.

Learn new Technologies

Steps For Calculating PCA

Follow the below steps to calculate PCA:

  1. Standardize the data
  2. Compute the covariance matrix for the data variables
  3. Computing the eigenvectors and eigenvalues and order them in descending order
  4. Then, calculate the Principal Components
  5. Perform ‘dimensionality reduction’ of the data set

Let’s discuss each of the steps in detail:

Step 1: Standardize the data

Standardization is the first step in data analysis and processing. It is scaling of data within a specific range so that the output of the corresponding variables is unbiased. 

Standardization (Z) is calculated as follows:

Standardization

Using this formula, all the variables will be scaled over a common scale.

Interested in learning Machine Learning? Click here to learn more about this Machine Learning Training in Bangalore!

Step 2: Compute the covariance matrix for the data variables.

We calculate the covariance matrix to recognize the interdependencies between the variables and reduce it to improve the performance of the model.

The following is the formula to calculate the covariance of two-dimensional data:

Calculating covariance

Here, Cov(x, x) and Cov(y,y) is the covariance of x and y with itself.

Cov(x, y) is the covariance of ‘x’ w.r.t  ‘y

Cov(x, y) = Cov(y, x) {By Commutative property}

Also,

If  Cov(x,y) is -ve, then x∝(1/y)

And if  Cov(x,y) is +ve, x∝y

Step 3: Computing the eigenvectors and eigenvalues 

In order to determine the PCA, eigenvectors, and eigenvalues must be calculated from the covariance matrix. Therefore, for each eigenvector, there is an eigenvalue. 

Also, the computation of eigenvectors depends on the dimensions of the data.

If the data is two-dimensional, then we would have to calculate two eigenvectors and their corresponding eigenvalues. The main objective of eigenvectors is to calculate the Principal Components by finding out the largest variance that exists in the dataset. The larger the variance, the greater the information content of the data points.

Now, Eigenvalues are just the magnitude of the eigenvectors. Both help to calculate the Principal Components.

Interested in Data Analytics? Enroll in this Data Analytics Online Course to learn from the experts.

Step 4: Calculating the Principal Components 

After we have finished with the calculation of eigenvectors and eigenvalues, we will arrange them in descending order. Then, the very first Principal Component would be the eigenvector with the largest eigenvalue. For the purpose of dimensionality reduction, we can eliminate the principal components with minor significances. 

We then calculate the feature matrix that holds all the meaningful data that consists of maximal details and figures of the data.

Interested in learning Machine Learning? Click here to learn more about this Machine Learning Training in New York!

Step 5: Dimensionality reduction of the Data Set

Finally, organize the primary data with the principal components. So, to put the computed principal components in place of the original dataset, we compute the transpose of the original data and multiply it by the transpose of the derived feature vector. This is how we do the Principal Component Analysis.

Become an Artificial Intelligence Engineer

H2-Applications of Principal Component Analysis (PCA)

Principal Component Analysis (PCA) has broad applicability in the field of Machine Learning and Data Science. It is used to create highly efficient Machine Learning models because it minimizes the complexity of the system by dimensionality reduction. 

Some of the major application areas of Principal Component Analysis are:

1. Face Recognition 

2. Computer Vision 

3. Compressing image

4. Bioinformatics

5. Unboxing highly dimensional data in the field of banking and finance to reveal suspicious activities.

This is all about Principal Component Analysis
(PCA) and the areas where it is exactly used.

Course Schedule

Name Date Details
Data Analytics Courses 23 Nov 2024(Sat-Sun) Weekend Batch View Details
30 Nov 2024(Sat-Sun) Weekend Batch
07 Dec 2024(Sat-Sun) Weekend Batch

About the Author

Senior Associate - Digital Marketing

Shailesh is a Senior Editor in Digital Marketing with a passion for storytelling. His expertise lies in crafting compelling brand stories; he blends his expertise in marketing with a love for words to captivate audiences worldwide. His projects focus on innovative digital marketing ideas with strategic thought and accuracy.