K-Nearest Neighbour ( KNN ) Algorithm in Machine Learning.

K-Nearest Neighbour ( KNN ) Algorithm in Machine Learning.

In this blog, we will learn about the KNN algorithm, why we need it, and the types of distance metrics used. Along with these topics, we will also cover the implementation and know the facts behind “why the KNN algorithm is a lazy algorithm”.

Table of Contents

What is KNN Algorithm in Machine Learning?

The K-Nearest Neighbors (KNN) algorithm is a versatile supervised learning approach used for classification and regression tasks. In KNN, data points are classified based on the majority class of their nearest neighbors. The “k” represents the number of clusters. KNN’s simplicity and effectiveness make it valuable in various applications, although it may be sensitive to outliers and requires careful selection of the optimal k value for optimal performance.

Stay Ahead of the Curve
with Our Future-Focused Data Science Certification
quiz-icon

Why do we need KNN Algorithm?

KNN is rather intuitive and easy to implement. That makes it a great tool both for beginners and even a professional. Overall, KNN is a really very flexible yet effective way for predictions based on similarity between points in data.

Suppose you have two classes, say A and B, and you meet a new observation, denoted by x. In this case, the task is to determine in which class x falls. And that is where K-Nearest Neighbors algorithms come into play. KNN is an extremely robust classification tool for this kind of thing. This will identify the category or class it is likely to belong to by KNN based on proximity to its neighbors in new data points. The idea of KNN is actually intuitive for categorizing or labeling unknown data based on how close it is to known data points through the collective influence of its closest data points.

Why do we need KNN Algorithm?

Some of the other reasons why the KNN algorithm is essential are given below:

  • Firstly, KNN is a multifunctional and simple classification algorithm that can be applied to both classification and regression tasks. It’s particularly useful when the underlying data distribution is not well-defined or linear. 
  • Secondly, KNN doesn’t require assumptions about the underlying data, making it applicable in various scenarios. 
  • Thirdly, it excels in cases where the decision boundaries are complex and non-linear. 

Types of Distance Metrics Used in KNN Algorithm

To identify which data points are to be considered alike in KNN, it measures how far apart different data points are. We use such a measurement in order to group similar points together, much like when we find out how close or distant things are within a neighborhood. The blocks between the house could be distance. Thus, it matters because the distance helps tell us the group, or “cluster,” that any new point belongs to. So, if a new house is close to many others, it joins their cluster. Distance measurement in KNN is like finding who your closest neighbors are; it makes it easy to group similar things together.

Euclidean distance measures straight-line distance, and Manhattan distance measures the sum of absolute differences along each dimension. However, Minkowski distance is a generalized form with a parameter (p) that allows you to adjust the sensitivity to different dimensions. The choice between these metrics depends on the nature of the data and the problem being solved. Let’s step into explanations for the three distance metrics: Euclidean Distance, Manhattan Distance, and Minkowski Distance.

1. Euclidean Distance

Formula:

 Euclidean Distance

Explanation: Euclidean distance may be defined as the distance in the Euclidean space between two points. For simplicity, in a two-dimensional space, if you have two points (x1, y1) and (x2, y2), the Euclidean distance (d) between them is given by the Pythagorean theorem: ga (or d = sqrt{(x2 – x1)^2 + (y2 – y1)^2})}. This distance metric is sensitive both to the absolute value of vectors and to the direction of vectors

Supercharge Your Data Science Skills
with Our Industry-Recognized Certification
quiz-icon

2. Manhattan Distance 

Formula:

Manhattan Distance

Explanation: Manhattan or L1 distance or a taxicab geometry is a distance that can operate on two points of coordinates (x1, y1) and (x2, y2) by finding the absolute difference of the two values on either the x or the y axis. Imagine navigating a city grid: distance is defined as the combined length of the horizontal and the vertical routes on the grid lines to get to a particular point. Likewise, the Manhattan distance is defined by the summation of the absolute disparities over all the dimensions.

3. Minkowski Distance

Formula:

Minkowski Distance

Explanation: Minkowski distance is a generalization of both Euclidean and Manhattan distances. The parameter p allows you to adjust the formula. So, when p = 2, it is just the Euclidean distance; when p = 1, it turns out to be the Manhattan distance. For other values of p, it simply represents a more general form. Minkowski distance is used when you want to control the sensitivity to different dimensions based on problem requirements.

How to Choose “k” for KNN

k is perhaps the best parameter of the k-Nearest Neighbors (k-NN) algorithm, and it determines the choice of an object to a large extent. The degree of freedom k in the model involves the selection between the overfitting model and underfitting model.

A smaller k, such as 1 or 3, tends to increase the sensitivity of the algorithm to noise and outliers. This may lead to overfitting. A higher k, such as 10 or 20, smoothes out the decision boundaries, which may eventually result in underfitting.

To determine the optimal k:

1. Consider Dataset Characteristics

Evaluate the nature of the dataset. Noisy or small datasets might call for a smaller k, while larger datasets can make use of a larger k

2. Odd Values for Binary Classification

For binary classification tasks, an odd value of k is appropriate to break any ties in voting for the majority class.

3. Cross-Validation

Use cross-validation approaches, for example k-fold cross-validation, for analyzing the model’s ability when using a different value of k. This determines the best k that may lead to good balance between overfitting bias and high variance

A grid search is defined over a list of k, analyzing the effects of each k, respectively, to the general model performance. Select that value of k that best shows the maximum accuracy or according to the best evaluation

5. Visualization

Plot the decision boundaries for varying values of k to understand their implications on the ability of a model to grasp underlying data relationships.

Working of KNN Algorithm in Machine Learning

According to K-Nearest Neighbors, classify or forecast a new data point in that category of majority class and average of k-nearest neighbors. In the given problem, this is elaborated with step-by-step, so let’s see about it.

Working of KNN Algorithm

Step 1: Choose the Value of k

Decide the number of neighbors (k) to consider. This is a crucial parameter that can impact the algorithm’s performance.

Step 2: Calculate Distances

Compute the distance between the new data point and each and every point of the training dataset. Normally, Euclidean distance, Manhattan distance or other distances are used as per the type of the problem.

Step 3: Identify Neighbors

Select k data points from the training set which are closest to the new point computed by step 2.

Step 4: Majority Voting (Classification) or Weighted Averaging (Regression)

For classification tasks, the majority class among k nearest neighbors is determined and assigned to the new data point. 

For regression tasks, the mean of the target values of k nearest neighbors is calculated to become the predicted value of the new data point

Step 5: Make Prediction

Label the new pattern under analysis as belonging to the class, or having the value predicted by the response of the majority voting or averaging.

Step 6: Output

In the end, based on a certain threshold, the KNN algorithm provides the final classification or prediction of new sample.

Step 7: Evaluate Performance

If you’ve labeled the data, you evaluate the algorithm with the help of accuracy, precision, recall, or F1 score.

Why is KNN a Lazy Algorithm?

The KNN algorithm is termed a “lazy” algorithm because it generalizes nothing during the training process. In a lazy algorithm, the model is not actually trained on the dataset. It instead memorizes all of the data. All that is done is postponing the processing of training data until a new, unseen data point needs to be classified or predicted.

For KNN, in the training period,it just loads the training data in to its memory. In order to predict for the new data point the distances between that point and all points in the training set are calculated by the algorithm. It then identifies the k-nearest neighbors relative to those distances and predicts according to the mode of the nearest class or mean of target value.

The term “lazy” is used to highlight that the algorithm doesn’t actively learn a model during the training phase; it defers the learning until the prediction phase, when the specific instance needs to be classified. This characteristic makes KNN simple and flexible but can also lead to higher computational costs during prediction, especially with large datasets.

Unlock Your Potential
with Our Comprehensive Certification Program
quiz-icon

Implementation of KNN Algorithm in Machine Learning

Refer to the code below to understand the implementation of KNN algorithm in machine learning:

Step 1 – Import the Libraries

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Step 2 – Data Loading

# X represents features, and y represents labels
X = [[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]]
y = ['A', 'A', 'A', 'B', 'B', 'B']

Step 3 – Splitting the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4 – Create a Clustering Model

knn_classifier = KNeighborsClassifier(n_neighbors=3)

Step 5 – Fit the data and Make Prediction

knn_classifier.fit(X_train, y_train)
y_pred = knn_classifier.predict(X_test)

Step 6 – Evaluate the Model

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Step 7 – Testing the Model

new_data_point = [[4, 4]]
predicted_class = knn_classifier.predict(new_data_point)
print(f'Predicted class for {new_data_point}: {predicted_class}')

Step 8 – Output: 

Accuracy: 67.9

Predicted class for [[4, 4]]: [‘B’]

You’re going to run an instance of the classification model through use of the KNN algorithm. You’re replacing it with your data; for the sample just use anything that has correspondences that would be their features-X and labels – y. “n_neighbors” would determine the no of neighboring consideration.

Advantages and Disadvantages of KNN Algorithm 

Understanding these pros and cons is essential when deciding whether KNN is suitable for a particular task and dataset or not. Below are some of the pros and cons of the KNN algorithm.

Advantages 

  • KNN is straightforward and easy to understand, which makes it available for starters.
  • The model just memorizes the entire dataset, and hence, this is a lazy learning algorithm. KNN does not involve a training phase and is very quick to implement.
  • KNN is non-parametric because it does not assume anything about the underlying data distribution. Thus, it can be applied to almost any type of dataset.
  • The algorithm is adaptive to changes in the data during runtime, making it appropriate for dynamic environments where the data distribution may shift.
  • It tends to work best when the dataset is relatively small or has a relatively simple structure.

Disadvantages 

  • The algorithm memorizes the entire dataset, which in turn leads to high memory usage. For large datasets, it becomes impractical.
  • Such data noises or outliers may really make an impact in predictions, considering KNN works by relying upon the majority class or the average of k-nearest neighbors.
  • The performance of KNN depends on the scale of the features as larger scales may dominate in distance calculations.
  • The choice of the parameter k (number of neighbors) is critical, and choosing an inappropriate value can lead to suboptimal results. Cross-validation may be needed to find the best k for a given dataset.

Get 100% Hike!

Master Most in Demand Skills Now!

Conclusion

It goes without saying that the KNN algorithm is one of the simplest and most easily comprehensible tools in the sphere of machine learning. It is very easy to understand, implement, and provides a basis for creating various classification and regression models is a very important tool for both new and experienced data scientists. Hope you have got a lot more idea on what is KNN. If you want to learn more about these kind of algorithm, please check out our Data Science Course

Our Machine Learning Courses Duration and Fees

Program Name
Start Date
Fees
Cohort starts on 18th Jan 2025
₹70,053
Cohort starts on 8th Feb 2025
₹70,053

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.