This technique is frequently used to find patterns and correlations in large data sets in a variety of disciplines, including biology, social sciences, and computer science.
Let’s dive deeper to understand the hierarchical clustering with the following sub-topics:
Table of Content
Introduction to Hierarchical Clustering
Hierarchical clustering is said to be one of the very oldest traditional methods in grouping related data objects in Data Science. This method is indeed unsupervised and hence can be useful in exploratory data analysis irrespective of any prior knowledge of labels or data concerning it.
It first represents the data points as independent clusters and then combines or splits them according to some similarity or distance metric. This procedure is repeated until a stopping condition is met, usually when the desired number of clusters or a certain threshold for similarity is achieved.
Another of its advantages is that it can create a dendrogram, which is a tree-like structure showing the hierarchical links between clusters. With hierarchical clustering, users may use the dendrogram to see the result of clustering and determine how many clusters to use in future study
In this dendrogram, data points A, B, and C are given at the bottom, while the branches above them will represent the clusters that each belongs to. At every level, the distance or similarity of the clusters can be shown on the vertical axis. In this example, A and B happen to be the closest and they get merged into a new cluster shown by the connecting branch between them. This new cluster is then merged with C to form the final cluster, represented by the top branch in the dendrogram.
Supercharge Your Data Science Skills
with Our Industry-Recognized Certification
Why do we need Hierarchical Clustering?
Hierarchical clustering is in demand because it is helpful in exploratory data analysis since it does not require any prior information or labeling of the data. This method may be very helpful when working with vast and complicated datasets since it allows researchers to find patterns and links in the data without any prior preconceptions.
It also provides a dendrogram as a visual output of the grouping results. Users can use the dendrogram to analyze the hierarchical structure of the data and determine how many clusters to apply in future analysis. It is very helpful when the number of clusters is not directly obvious for huge datasets, hence this is the prime reason why hierarchical clustering is in high demand.
How Hierarchical Clustering Works?
An unsupervised machine learning approach that is referred to as hierarchical clustering sorts comparable items into groups based on their proximity or resemblance. It works by splitting or merging clusters until a stopping requirement is satisfied.
First, the algorithm treats each data point as a cluster separately. It then merges the two closest clusters into a single cluster at each iteration until only one cluster contains all of the data points. This procedure results in a dendrogram, which is a tree-like diagram showing the hierarchical connection between the clusters
In hierarchical clustering, the choice of distance or similarity metric is crucial. Manhattan distance, Euclidean distance, and cosine similarity are three common distance metrics. The types of data and research issues are being addressed to determine the distance metric to be used.
Code for creating a Dendrogram
1. Install the required Libraries
!pip install scipy
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
2. Make a Sample Data
# Create a sample dataset
X = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 30], [85, 70], [71, 80], [60, 78], [70, 55], [80, 91]])
# Perform hierarchical clustering on the dataset
Z = linkage(X, 'ward')
4. Plotting the Dendrogram
# Plot the dendrogram
fig = plt.figure(figsize=(10, 5))
dn = dendrogram(Z)
plt.show()
To show the process of hierarchical clustering, we generated a dataset X consisting of 10 data points with 2 dimensions. Then, the “ward” method is used from the SciPy library to perform hierarchical clustering on the dataset by calling the linkage function.
After that, the dendrogram function is used to plot the hierarchical clustering result, where the height of each node represents the distance between the merged clusters. The dendrogram plot provides an informative visualization of the clustering result.
Types of Hierarchical Clustering
Agglomerative and divisive clustering are the two basic forms of hierarchical clustering. Let’s discuss each of them in detail:
Agglomerative clustering
Agglomerative clustering is the most common method of hierarchical clustering, where it iteratively unites smaller groups of single data points into bigger clusters on how similar or distant they are to one another.
This is done, until all data points are part of a single cluster, the two closest clusters at each phase are combined to form a new cluster. The outcome is a dendrogram that shows the clusters’ hierarchical connections..
Divisive clustering
The clustering type of this method begins with a big cluster, then breaks it into more compact clusters by how dissimilar they are. Comparatively speaking, this kind of approach is used less than the agglomerative method as it tends to consume more resources and has unstable results at times
Unlock Your Data Science Potential
with Our Comprehensive Certification Program
Advantages of Hierarchical Clustering
Hierarchical clustering provides the following benefits:
- Creates a dendrogram: A dendrogram is a visual representation of the results of hierarchical clustering. The dendrogram demonstrates the hierarchical links between the clusters, enabling researchers to decide on the ideal number of clusters and proceed with additional analysis with confidence.
- Flexibility: All kinds of data including category, binary, and continuous data can be used with hierarchical clustering.
- The number of clusters need not be specified: Hierarchical clustering does not require a number of clusters in advance, unlike the case with other clustering algorithms. The dendrogram has an inherent threshold so that researchers can opt for the appropriate number of clusters
- Robust against noise: Hierarchical clustering is strong against data noise and outliers. Even when there is substantial noise in the data, it is still able to recognize and group related data points.
- Results can be understood easily: The dendrogram created by hierarchical clustering is simple to understand and can offer insights into the underlying structure of the data. Labeling the clusters and providing a meaningful interpretation of the findings are additional options.
Use Cases of Hierarchical Clustering
A flexible and popular method with many useful applications is hierarchical clustering. Listed below are a few applications for hierarchical clustering:
- Market segmentation: Based on their commonalities, clients or items may be divided into groups using hierarchical clustering. This can assist companies in identifying various client categories and customizing their marketing plans accordingly.
- Image segmentation: According to their similarities, photographs may be divided into various areas using hierarchical clustering. Applications for computer vision and image processing may benefit from this.
- Gene expression analysis: Analyzing gene expression data and identifying patterns of gene expression in several samples may be done using hierarchical clustering. This can aid in understanding the biology of illnesses and the development of novel therapies.
- Social network analysis: Hierarchical clustering may be used to examine social network data, and finds groups of people that share a common set of interests or activities. This is helpful for marketing.
Get 100% Hike!
Master Most in Demand Skills Now!
Conclusion
Hierarchical clustering is an efficient tool in data analysis that would later enable the companies to unearth subtle patterns, relationships, and important insights from the data. Grouping similar points gives companies a better insight into their operations to utilize it for better-informed decisions and improved business results. Check this very comprehensive course about Data Science.
Our Data Science Courses Duration and Fees
Cohort starts on 4th Feb 2025
₹65,037
Cohort starts on 28th Jan 2025
₹65,037
Cohort starts on 14th Jan 2025
₹65,037