Clustering is an important and fundamental technique in unsupervised machine learning. It is used to group data points that are similar to each other, in the form of clusters. K-Means stands out among other clustering algorithms because of its simplicity and efficiency.
In this blog, we are going to discuss the basics of the K-Means clustering algorithm and also analyze its operational structure and its applications. We will also try to implement it in Python. So, let’s dive in!
Table of Contents
Understanding K-Means Clustering
K-Means Clustering is a technique that is used to organize data into separate groups based on their similarity. It operates by dividing the data into K different clusters. Here, each cluster should have a central point called a centroid. This centroid is used to represent the average of all the points present in that group. By moving the centroids until they stop changing, this algorithm keeps adjusting the clusters. The main aim of K-Means Clustering is to organize similar data points in the same group/clusters.
Working of K-Means Clustering
Suppose we are given a data set of items that includes various items having specific features, which contain corresponding values (similar to vectors). Here, the task is to categorize those items into groups. To do this, we will use the K-Means clustering algorithm. The “K” in the name of the algorithm is used to represent the number of groups/clusters that we want to classify our items into.
This algorithm will categorize the items into k groups or clusters that contain similar items/data points. For calculating the similarity, we will use the Euclidean distance as a measurement. The working of the algorithm is given below:
- At first, we need to initialize k points randomly, which are called means or cluster centroids.
- Next, we assign each item to its nearest mean, after which we have to adjust the mean coordinates with the current items which are falling into that cluster.
- We need to repeat this process for a given number of iterations, and at the end, we will have our clusters.
Acquire Specialized Expertise in Data Science Trends
Turn Data into Insights with Our Training
Implementation of K-Means Clustering in Python
For implementation, we will use the Scikit-learn library, which provides a straightforward implementation of the K-means clustering algorithm. Below, we have mentioned a step-by-step implementation of K-Means Clustering in Python.
Step 1: Importing the necessary libraries
Here, we have imported NumPy for statistical computations, Matplotlib for plotting the graph, and make_blobs from sklearn.datasets.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
Step 2: Create a custom dataset
In this step, we create a custom dataset with make_blobs and then plot it on a graph.
X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
The above code is used to generate a synthetic dataset that contains 500 samples, 2 features, and 3 clusters. It uses make_blobs and visualizes the data points in a 2D scatter plot with a grid.
Step 3: Initialize the Random Centroids
k = 3
clusters = {}
np.random.seed(23)
for idx in range(k):
center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}
clusters[idx] = cluster
clusters
The above code initializes 3 clusters for K-Means clustering. It then sets a random seed and generates random cluster centers within a specified range. After that, it creates an empty list of points for each cluster.
Step 4: Plotting the random initialize center with data points
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()
In the below output, the plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also shows the initial cluster centers (red stars), which are generated for K-means clustering.
Step 5: Defining Euclidean Distance
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Here, the function distance(p1, p2) is used to calculate the Euclidean distance between two points p1 and p2. It computes the square root of the sum of the squared differences between the coordinates of p1 and p2. The above code does not generate an output because the function is only defined, but not called.
Step 6: Creating the function to assign and update the cluster center
In this step, the data points are assigned to the nearest cluster. The M-step updates centers, which are based on the mean of assigned points in K-means clustering.
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
def update_clusters(X, clusters):
for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
new_center = points.mean(axis =0)
clusters[i]['center'] = new_center
clusters[i]['points'] = []
return clusters
Here, the function assign_clusters(X, clusters) is used to assign each data point in X to the nearest cluster center. This is done by computing the Euclidean distance and updating the point list of the cluster. While update_clusters(X, clusters) is used to calculate the cluster centers again based on the mean of the assigned points. It then resets the cluster point list for the next iteration. This code does not generate an output because both functions are only defined but not called with any input data.
Step 7: Creating the function to predict the cluster for the datapoints
def pred_cluster(X, clusters):
pred = []
for i in range(X.shape[0]):
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred
Here, the function pred_cluster(X, clusters) helps to predict the cluster for each data point X. It is done by calculating the Euclidean distance from all the cluster centers and assigning it to the nearest cluster. The above code does not generate an output because the function is only defined and not called with actual input data.
Step 8: Assign, Update, and predict the cluster center
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
The above code is used to assign data points X to the nearest cluster (assign_clusters). It then updates the cluster centers ( update_clusters ) and predicts the cluster labels for each data point ( pred_cluster ). This code does not generate an output because there is no print or return statement for displaying the results.
Step 9: Plotting the data points with their predicted cluster center
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
Here, the plot denotes the data points colored by their predicted customers. The red markers are used to represent the updated cluster centers after E-M steps in the K-means clustering algorithm.
Lead Data Science with our free course.
Enroll in Our Free Data Science Program
Elbow Method in K-Means Clustering
The Elbow Method in K-Means Clustering is an important technique that is used to determine the optimal number of clusters (k) in K-Means Clustering. It identifies the right balance between the number of clusters and the within-cluster variance. It is important to select the number of clusters because the use of too few clusters may oversimplify the data, whereas the use of too many clusters may lead to overfitting.
1. Understanding the Elbow Method
The main concept behind the Elbow Method is to calculate the WCSS (Within-Cluster Sum of Squares). It is used for different values of K, and then it identifies the point where the rate of WCSS decreases significantly. This point is called the elbow point, which indicates the optimal number of clusters.
- WCSS (Within-Cluster Sum of Squares): It is used to measure the compactness of the clusters. The lower the WCSS means that the clusters are denser and more well-defined.
- As the value of K increases, WCSS decreases. This is because clusters become smaller and more compact. However, after a certain point, the decrease in the value of K is minimal. This forms an elbow-like shape in the graph.
- The value of the elbow point is the optimal K. Here, adding more clusters does not reduce the WCSS significantly.
2. Steps to Implement the Elbow Method
- You have to perform K-Means clustering for different values of K (e.g., K = 1 to 10).
- You then have to calculate the WCSS for each value of K.
- Then you have to plot K vs. WCSS for visualizing the elbow point.
- Lastly, select the value of K at the elbow point.
3. Python Code Implementation of the Elbow Method
This code demonstrates how we can implement Elbow Method
Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=5, random_state=42)
# List to store WCSS values
wcss = []
# Try different values of K (from 1 to 10)
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
wcss.append(kmeans.inertia_) # Inertia is the WCSS
# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal K')
plt.show()
Output
The above code is used to implement the Elbow Method. This is done by generating synthetic data by using make_blobs, and then performing K-Means clustering for different values of K (1 to 10). After that, it computes the WCSS for each K, and then plots the Elbow Curve, which helps to determine the optimal number of clusters.
Applications of K-Means Clustering
Some of the applications of K-Means Clustering are given below:
- Customer Segmentation: K-Means clustering is used for grouping customers based on their purchasing behavior for targeted marketing.
- Image Compression: K-Means clustering is also used for reducing the number of colors in an image by clustering similar colors.
- Anomaly detection: It is also used for the detection of unusual data points that do not fit into any cluster.
- Document clustering: It is also used for organizing documents into topics based on the similarity of the content.
Advantages and Limitations of K-Means Clustering
1. Advantages of K-Means Clustering
- It is simple and easy to implement.
- It provides efficient computation for large datasets.
- Works well in clusters with a spherical shape.
2. Limitation of K-Means Clustering
- It requires prior knowledge of the number of clusters.
- It is sensitive to the initial placement of the centroids. This leads to different results.
- It is not suitable for clusters with varying sizes or non-convex shapes.
Get 100% Hike!
Master Most in Demand Skills Now!
Conclusion
From this blog, we can conclude that the K-Means Clustering algorithm is a powerful unsupervised learning algorithm. It is mainly used for partitioning datasets into meaningful clusters. Having a good understanding of its working mechanism, applications, and limitations, you can implement it effectively for various data analysis tasks. It is important to properly select the number of clusters and carefully initialize the centroids to achieve the optimal results.
If you are interested in learning more about similar algorithms, then we recommend you to check out our Data Science Course today!!
Our Data Science Courses Duration and Fees
Cohort Starts on: 3rd May 2025
₹85,044
Cohort Starts on: 10th May 2025
₹85,044