What Is K means clustering algorithm in Python

Clustering is an important and fundamental technique in unsupervised machine learning. It is used to group data points that are similar to each other, in the form of clusters. K-Means stands out among other clustering algorithms because of its simplicity and efficiency.

In this blog, we are going to discuss the basics of the K-Means clustering algorithm and also analyze its operational structure and its applications. We will also try to implement it in Python. So, let’s dive in!

Table of Contents

Understanding K-Means Clustering
Working of K-Means Clustering
Implementation of K-Means Clustering in Python
Elbow Method in K-Means Clustering
Applications of K-Means Clustering
Advantages and Limitations of K-Means Clustering
Conclusion

Understanding K-Means Clustering

K-Means Clustering is a technique that is used to organize data into separate groups based on their similarity. It operates by dividing the data into K different clusters. Here, each cluster should have a central point called a centroid. This centroid is used to represent the average of all the points present in that group. By moving the centroids until they stop changing, this algorithm keeps adjusting the clusters. The main aim of K-Means Clustering is to organize similar data points in the same group/clusters.

Working of K-Means Clustering

Suppose we are given a data set of items that includes various items having specific features, which contain corresponding values (similar to vectors). Here, the task is to categorize those items into groups. To do this, we will use the K-Means clustering algorithm. The “K” in the name of the algorithm is used to represent the number of groups/clusters that we want to classify our items into.

This algorithm will categorize the items into k groups or clusters that contain similar items/data points. For calculating the similarity, we will use the Euclidean distance as a measurement. The working of the algorithm is given below:

At first, we need to initialize k points randomly, which are called means or cluster centroids.
Next, we assign each item to its nearest mean, after which we have to adjust the mean coordinates with the current items which are falling into that cluster.
We need to repeat this process for a given number of iterations, and at the end, we will have our clusters.

Acquire Specialized Expertise in Data Science Trends

Turn Data into Insights with Our Training

Explore Program

Implementation of K-Means Clustering in Python

For implementation, we will use the Scikit-learn library, which provides a straightforward implementation of the K-means clustering algorithm. Below, we have mentioned a step-by-step implementation of K-Means Clustering in Python.

Step 1: Importing the necessary libraries

Here, we have imported NumPy for statistical computations, Matplotlib for plotting the graph, and make_blobs from sklearn.datasets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

Step 2: Create a custom dataset

In this step, we create a custom dataset with make_blobs and then plot it on a graph.

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()

The above code is used to generate a synthetic dataset that contains 500 samples, 2 features, and 3 clusters. It uses make_blobs and visualizes the data points in a 2D scatter plot with a grid.

Step 3: Initialize the Random Centroids

k = 3

clusters = {}
np.random.seed(23)

for idx in range(k):
    center = 2*(2*np.random.random((X.shape[1],))-1)
    points = []
    cluster = {
        'center' : center,
        'points' : []
    }
    
    clusters[idx] = cluster
    
clusters

The above code initializes 3 clusters for K-Means clustering. It then sets a random seed and generates random cluster centers within a specified range. After that, it creates an empty list of points for each cluster.

Step 4: Plotting the random initialize center with data points

plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
    center = clusters[i]['center']
    plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()

In the below output, the plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also shows the initial cluster centers (red stars), which are generated for K-means clustering.

Step 5: Defining Euclidean Distance

def distance(p1,p2):
    return np.sqrt(np.sum((p1-p2)**2))

Here, the function distance(p1, p2) is used to calculate the Euclidean distance between two points p1 and p2. It computes the square root of the sum of the squared differences between the coordinates of p1 and p2. The above code does not generate an output because the function is only defined, but not called.

Step 6: Creating the function to assign and update the cluster center

In this step, the data points are assigned to the nearest cluster. The M-step updates centers, which are based on the mean of assigned points in K-means clustering.

def assign_clusters(X, clusters):
    for idx in range(X.shape[0]):
        dist = []
        
        curr_x = X[idx]
        
        for i in range(k):
            dis = distance(curr_x,clusters[i]['center'])
            dist.append(dis)
        curr_cluster = np.argmin(dist)
        clusters[curr_cluster]['points'].append(curr_x)
    return clusters

def update_clusters(X, clusters):
    for i in range(k):
        points = np.array(clusters[i]['points'])
        if points.shape[0] > 0:
            new_center = points.mean(axis =0)
            clusters[i]['center'] = new_center
            
            clusters[i]['points'] = []
    return clusters

Here, the function assign_clusters(X, clusters) is used to assign each data point in X to the nearest cluster center. This is done by computing the Euclidean distance and updating the point list of the cluster. While update_clusters(X, clusters) is used to calculate the cluster centers again based on the mean of the assigned points. It then resets the cluster point list for the next iteration. This code does not generate an output because both functions are only defined but not called with any input data.

Step 7: Creating the function to predict the cluster for the datapoints

def pred_cluster(X, clusters):
    pred = []
    for i in range(X.shape[0]):
        dist = []
        for j in range(k):
            dist.append(distance(X[i],clusters[j]['center']))
        pred.append(np.argmin(dist))
    return pred

Here, the function pred_cluster(X, clusters) helps to predict the cluster for each data point X. It is done by calculating the Euclidean distance from all the cluster centers and assigning it to the nearest cluster. The above code does not generate an output because the function is only defined and not called with actual input data.

Step 8: Assign, Update, and predict the cluster center

clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)

The above code is used to assign data points X to the nearest cluster (assign_clusters). It then updates the cluster centers ( update_clusters ) and predicts the cluster labels for each data point ( pred_cluster ). This code does not generate an output because there is no print or return statement for displaying the results.

Step 9: Plotting the data points with their predicted cluster center

plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
    center = clusters[i]['center']
    plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()

Here, the plot denotes the data points colored by their predicted customers. The red markers are used to represent the updated cluster centers after E-M steps in the K-means clustering algorithm.

Lead Data Science with our free course.

Enroll in Our Free Data Science Program

Explore Program

Elbow Method in K-Means Clustering

The Elbow Method in K-Means Clustering is an important technique that is used to determine the optimal number of clusters (k) in K-Means Clustering. It identifies the right balance between the number of clusters and the within-cluster variance. It is important to select the number of clusters because the use of too few clusters may oversimplify the data, whereas the use of too many clusters may lead to overfitting.

1. Understanding the Elbow Method

The main concept behind the Elbow Method is to calculate the WCSS (Within-Cluster Sum of Squares). It is used for different values of K, and then it identifies the point where the rate of WCSS decreases significantly. This point is called the elbow point, which indicates the optimal number of clusters.

WCSS (Within-Cluster Sum of Squares): It is used to measure the compactness of the clusters. The lower the WCSS means that the clusters are denser and more well-defined.
As the value of K increases, WCSS decreases. This is because clusters become smaller and more compact. However, after a certain point, the decrease in the value of K is minimal. This forms an elbow-like shape in the graph.
The value of the elbow point is the optimal K. Here, adding more clusters does not reduce the WCSS significantly.

2. Steps to Implement the Elbow Method

You have to perform K-Means clustering for different values of K (e.g., K = 1 to 10).
You then have to calculate the WCSS for each value of K.
Then you have to plot K vs. WCSS for visualizing the elbow point.
Lastly, select the value of K at the elbow point.

3. Python Code Implementation of the Elbow Method

This code demonstrates how we can implement Elbow Method

Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=5, random_state=42)

# List to store WCSS values
wcss = []

# Try different values of K (from 1 to 10)
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)  # Inertia is the WCSS

# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal K')
plt.show()

Output

The above code is used to implement the Elbow Method. This is done by generating synthetic data by using make_blobs, and then performing K-Means clustering for different values of K (1 to 10). After that, it computes the WCSS for each K, and then plots the Elbow Curve, which helps to determine the optimal number of clusters.

Applications of K-Means Clustering

Some of the applications of K-Means Clustering are given below:

Customer Segmentation: K-Means clustering is used for grouping customers based on their purchasing behavior for targeted marketing.
Image Compression: K-Means clustering is also used for reducing the number of colors in an image by clustering similar colors.
Anomaly detection: It is also used for the detection of unusual data points that do not fit into any cluster.
Document clustering: It is also used for organizing documents into topics based on the similarity of the content.

Advantages and Limitations of K-Means Clustering

1. Advantages of K-Means Clustering

It is simple and easy to implement.
It provides efficient computation for large datasets.
Works well in clusters with a spherical shape.

2. Limitation of K-Means Clustering

It requires prior knowledge of the number of clusters.
It is sensitive to the initial placement of the centroids. This leads to different results.
It is not suitable for clusters with varying sizes or non-convex shapes.

Get 100% Hike!

Master Most in Demand Skills Now!

Conclusion

From this blog, we can conclude that the K-Means Clustering algorithm is a powerful unsupervised learning algorithm. It is mainly used for partitioning datasets into meaningful clusters. Having a good understanding of its working mechanism, applications, and limitations, you can implement it effectively for various data analysis tasks. It is important to properly select the number of clusters and carefully initialize the centroids to achieve the optimal results.

If you are interested in learning more about similar algorithms, then we recommend you to check out our Data Science Course today!!