Introduction to K means Clustering in Python
K means clustering algorithm is a very common unsupervised learning algorithm. This algorithm clusters n objects into k clusters, where each object belongs to a cluster with the nearest mean.
Here’s the table of contents for this module:
Here is a video from Intellipaat on this topic:
Without much delay, let’s get started.
What Is Clustering Algorithms?
Clustering is nothing but dividing a set of data into groups of similar points or features, where data points in the same group are as similar as possible and data points in different groups are as dissimilar as possible. We use clustering in our day-to-day lives; for instance, in a supermarket, all vegetables are grouped as one group and all fruits are grouped as another group. This clustering helps customers fasten their shopping process.
Another clustering example we might have come across is the Amazon or Flipkart product recommendation. Amazon or Flipkart recommend us products based on our previous search. How do they do it? Well, the concept behind this is clustering.
Now that we know what clustering is, let us discuss the categories that we have in clustering.
Types of Clustering Algorithms
There are three main types of clustering techniques, they are as follow:
- Exclusive clustering: In exclusive clustering, data are grouped in an exclusive way so that a certain datum belongs to only one definite cluster.
- Overlapping clustering: In overlapping clustering, each point may belong to two or more clusters.
- Hierarchical clustering: In this technique, the first step is to assign all data points clusters of their own. The second step is to merge two nearer clusters into one cluster. The third step is to compute distances between the new cluster and each of the old clusters. Again, repeat the second and third steps until only one cluster is left.
Alright, now that we know the types of clustering, let us move ahead with the real topic of discussion, K means clustering.
Get 100% Hike!
Master Most in Demand Skills Now!
What Is K means Clustering Algorithm?
K means clustering is an algorithm, where the main goal is to group similar data points into a cluster. In K means clustering, k represents the total number of groups or clusters. K means clustering runs on Euclidean distance calculation. Now, let us understand K means clustering with the help of an example.
Say, we have a dataset consisting of height and weight information of 10 players. We need to group them into two clusters based on their height and weight.
Height | Weight |
180 | 80 |
172 | 73 |
178 | 69 |
189 | 82 |
164 | 70 |
186 | 71 |
180 | 69 |
170 | 76 |
166 | 71 |
180 | 72 |
Step 1: Initialize a cluster centroid
Initial Clusters | Height | Weight |
K1 | 185 | 70 |
K2 | 170 | 80 |
Step 2: Calculate the Euclidean distance from each observation to the initial clusters
Observation | Height | Weight | Distance from Cluster 1 | Distance from Cluster 2 | Assign Clusters |
1 | 180 | 80 | 11.18 | 10 | 2 |
2 | 172 | 73 | 13.3 | 7.28 | 2 |
3 | 178 | 69 | 7.07 | 13.6 | 1 |
4 | 189 | 82 | 12.64 | 19.10 | 1 |
5 | 164 | 70 | 21 | 11.66 | 2 |
6 | 186 | 71 | 1.41 | 18.35 | 1 |
7 | 180 | 69 | 5.09 | 14.86 | 1 |
8 | 170 | 76 | 16.15 | 4 | 2 |
9 | 166 | 71 | 19.02 | 9.84 | 2 |
10 | 180 | 72 | 5.38 | 12.80 | 1 |
Step 3: Find the new cluster centroid
Observation | Height | Weight | Assign Clusters |
1 | 180 | 80 | 2 |
2 | 172 | 73 | 2 |
3 | 178 | 69 | 1 |
4 | 189 | 82 | 1 |
5 | 164 | 70 | 2 |
6 | 186 | 71 | 1 |
7 | 180 | 69 | 1 |
8 | 170 | 76 | 2 |
9 | 166 | 71 | 2 |
10 | 180 | 72 | 1 |
New Cluster 1 | (178+189+186+180+180)/5 | (69+82+71+69+72)/5 | |
New Cluster 2 | (180+172+164+170+166)/5 | (80+73+70+76+76+71)/5 | |
New Cluster 1 = (182.6, 72.6)
New Cluster 1 = (170.4, 89.2)
Step 4: Again, calculate the Euclidean distance
Calculate the Euclidean distance from each observation to both Cluster 1 and Cluster 2
Repeat Steps 2, 3, and 4, until cluster centers don’t change any more
Now, let us look at the hands-on given below to have a deeper understanding of K-means algorithm.
Hands-on: K-means Clustering Algorithm Using Sklearn in Python- Iris Dataset
Dataset
We will be using the famous Iris Dataset, collected in the 1930s by Edgar Anderson. In this example, we are going to train a random forest classification algorithm to predict the class in the test data.
Let us see how we can implement K means clustering using Python, in this problem statement. To implement K-means we will be using sklearn library in Python. Let’s get started.
Step 1: Load the Iris Dataset
Step 2: Have a glance at the shape
Step 3: Have a glance at the features
Step 4: Have a glance at the targets
Step 5: Build the model
Step 6: Set up the number of clusters
Step 7: Fit the features into the model
Step 8: Predict and label the data
Step 9: Have a look at the cluster centers
Step 10: Evaluate the algorithm
From this cross-tabulation, we can conclude that all the 50 items in Category 0 are predicted correctly; 48 items in Category 1 are predicted correctly, but 2 items are predicted incorrectly to be of Category 2. 36 items in Category 2 are predicted correctly, but 14 items are predicted incorrectly.
What Did We Learn?
In this module, we started off by talking about the need of K-means algorithm, and then we discussed on what K-means algorithm is and how it works. At the end, we built one K-means algorithm model using Sklearn in Python. Hope you enjoyed this tutorial!