In this blog, we will learn about the KNN algorithm, why we need it, and the types of distance metrics used. Along with these topics, we will also cover the implementation and know the facts behind “why the KNN algorithm is a lazy algorithm”.
Table of Contents
What is KNN Algorithm in Machine Learning?
The K-Nearest Neighbors (KNN) algorithm is a versatile supervised learning approach used for classification and regression tasks. In KNN, data points are classified based on the majority class of their nearest neighbors. The “k” represents the number of clusters. KNN’s simplicity and effectiveness make it valuable in various applications, although it may be sensitive to outliers and requires careful selection of the optimal k value for optimal performance.
Stay Ahead of the Curve
with Our Future-Focused Data Science Certification
Why do we need KNN Algorithm?
KNN is rather intuitive and easy to implement. That makes it a great tool both for beginners and even a professional. Overall, KNN is a really very flexible yet effective way for predictions based on similarity between points in data.
Suppose you have two classes, say A and B, and you meet a new observation, denoted by x. In this case, the task is to determine in which class x falls. And that is where K-Nearest Neighbors algorithms come into play. KNN is an extremely robust classification tool for this kind of thing. This will identify the category or class it is likely to belong to by KNN based on proximity to its neighbors in new data points. The idea of KNN is actually intuitive for categorizing or labeling unknown data based on how close it is to known data points through the collective influence of its closest data points.
Some of the other reasons why the KNN algorithm is essential are given below:
- Firstly, KNN is a multifunctional and simple classification algorithm that can be applied to both classification and regression tasks. It’s particularly useful when the underlying data distribution is not well-defined or linear.
- Secondly, KNN doesn’t require assumptions about the underlying data, making it applicable in various scenarios.
- Thirdly, it excels in cases where the decision boundaries are complex and non-linear.
Types of Distance Metrics Used in KNN Algorithm
To identify which data points are to be considered alike in KNN, it measures how far apart different data points are. We use such a measurement in order to group similar points together, much like when we find out how close or distant things are within a neighborhood. The blocks between the house could be distance. Thus, it matters because the distance helps tell us the group, or “cluster,” that any new point belongs to. So, if a new house is close to many others, it joins their cluster. Distance measurement in KNN is like finding who your closest neighbors are; it makes it easy to group similar things together.
Euclidean distance measures straight-line distance, and Manhattan distance measures the sum of absolute differences along each dimension. However, Minkowski distance is a generalized form with a parameter (p) that allows you to adjust the sensitivity to different dimensions. The choice between these metrics depends on the nature of the data and the problem being solved. Let’s step into explanations for the three distance metrics: Euclidean Distance, Manhattan Distance, and Minkowski Distance.
1. Euclidean Distance
Formula:
Explanation: Euclidean distance may be defined as the distance in the Euclidean space between two points. For simplicity, in a two-dimensional space, if you have two points (x1, y1) and (x2, y2), the Euclidean distance (d) between them is given by the Pythagorean theorem: ga (or d = sqrt{(x2 – x1)^2 + (y2 – y1)^2})}. This distance metric is sensitive both to the absolute value of vectors and to the direction of vectors
Supercharge Your Data Science Skills
with Our Industry-Recognized Certification
2. Manhattan Distance
Formula:
Explanation: Manhattan or L1 distance or a taxicab geometry is a distance that can operate on two points of coordinates (x1, y1) and (x2, y2) by finding the absolute difference of the two values on either the x or the y axis. Imagine navigating a city grid: distance is defined as the combined length of the horizontal and the vertical routes on the grid lines to get to a particular point. Likewise, the Manhattan distance is defined by the summation of the absolute disparities over all the dimensions.
3. Minkowski Distance
Formula:
Explanation: Minkowski distance is a generalization of both Euclidean and Manhattan distances. The parameter p allows you to adjust the formula. So, when p = 2, it is just the Euclidean distance; when p = 1, it turns out to be the Manhattan distance. For other values of p, it simply represents a more general form. Minkowski distance is used when you want to control the sensitivity to different dimensions based on problem requirements.
How to Choose “k” for KNN
k is perhaps the best parameter of the k-Nearest Neighbors (k-NN) algorithm, and it determines the choice of an object to a large extent. The degree of freedom k in the model involves the selection between the overfitting model and underfitting model.
A smaller k, such as 1 or 3, tends to increase the sensitivity of the algorithm to noise and outliers. This may lead to overfitting. A higher k, such as 10 or 20, smoothes out the decision boundaries, which may eventually result in underfitting.
To determine the optimal k:
1. Consider Dataset Characteristics
Evaluate the nature of the dataset. Noisy or small datasets might call for a smaller k, while larger datasets can make use of a larger k
2. Odd Values for Binary Classification
For binary classification tasks, an odd value of k is appropriate to break any ties in voting for the majority class.
3. Cross-Validation
Use cross-validation approaches, for example k-fold cross-validation, for analyzing the model’s ability when using a different value of k. This determines the best k that may lead to good balance between overfitting bias and high variance
4. Grid Search
A grid search is defined over a list of k, analyzing the effects of each k, respectively, to the general model performance. Select that value of k that best shows the maximum accuracy or according to the best evaluation
Plot the decision boundaries for varying values of k to understand their implications on the ability of a model to grasp underlying data relationships.
Working of KNN Algorithm in Machine Learning
According to K-Nearest Neighbors, classify or forecast a new data point in that category of majority class and average of k-nearest neighbors. In the given problem, this is elaborated with step-by-step, so let’s see about it.
Step 1: Choose the Value of k
Decide the number of neighbors (k) to consider. This is a crucial parameter that can impact the algorithm’s performance.
Step 2: Calculate Distances
Compute the distance between the new data point and each and every point of the training dataset. Normally, Euclidean distance, Manhattan distance or other distances are used as per the type of the problem.
Step 3: Identify Neighbors
Select k data points from the training set which are closest to the new point computed by step 2.
Step 4: Majority Voting (Classification) or Weighted Averaging (Regression)
For classification tasks, the majority class among k nearest neighbors is determined and assigned to the new data point.
For regression tasks, the mean of the target values of k nearest neighbors is calculated to become the predicted value of the new data point
Step 5: Make Prediction
Label the new pattern under analysis as belonging to the class, or having the value predicted by the response of the majority voting or averaging.
Step 6: Output
In the end, based on a certain threshold, the KNN algorithm provides the final classification or prediction of new sample.
Step 7: Evaluate Performance
If you’ve labeled the data, you evaluate the algorithm with the help of accuracy, precision, recall, or F1 score.
Why is KNN a Lazy Algorithm?
The KNN algorithm is termed a “lazy” algorithm because it generalizes nothing during the training process. In a lazy algorithm, the model is not actually trained on the dataset. It instead memorizes all of the data. All that is done is postponing the processing of training data until a new, unseen data point needs to be classified or predicted.
For KNN, in the training period,it just loads the training data in to its memory. In order to predict for the new data point the distances between that point and all points in the training set are calculated by the algorithm. It then identifies the k-nearest neighbors relative to those distances and predicts according to the mode of the nearest class or mean of target value.
The term “lazy” is used to highlight that the algorithm doesn’t actively learn a model during the training phase; it defers the learning until the prediction phase, when the specific instance needs to be classified. This characteristic makes KNN simple and flexible but can also lead to higher computational costs during prediction, especially with large datasets.
Unlock Your Potential
with Our Comprehensive Certification Program
Implementation of KNN Algorithm in Machine Learning
Refer to the code below to understand the implementation of KNN algorithm in machine learning:
Step 1 – Import the Libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
Step 2 – Data Loading
# X represents features, and y represents labels
X = [[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]]
y = ['A', 'A', 'A', 'B', 'B', 'B']
Step 3 – Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4 – Create a Clustering Model
knn_classifier = KNeighborsClassifier(n_neighbors=3)
Step 5 – Fit the data and Make Prediction
knn_classifier.fit(X_train, y_train)
y_pred = knn_classifier.predict(X_test)
Step 6 – Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
Step 7 – Testing the Model
new_data_point = [[4, 4]]
predicted_class = knn_classifier.predict(new_data_point)
print(f'Predicted class for {new_data_point}: {predicted_class}')
Step 8 – Output:
Accuracy: 67.9
Predicted class for [[4, 4]]: [‘B’]
You’re going to run an instance of the classification model through use of the KNN algorithm. You’re replacing it with your data; for the sample just use anything that has correspondences that would be their features-X and labels – y. “n_neighbors” would determine the no of neighboring consideration.
Advantages and Disadvantages of KNN Algorithm
Understanding these pros and cons is essential when deciding whether KNN is suitable for a particular task and dataset or not. Below are some of the pros and cons of the KNN algorithm.
Advantages
- KNN is straightforward and easy to understand, which makes it available for starters.
- The model just memorizes the entire dataset, and hence, this is a lazy learning algorithm. KNN does not involve a training phase and is very quick to implement.
- KNN is non-parametric because it does not assume anything about the underlying data distribution. Thus, it can be applied to almost any type of dataset.
- The algorithm is adaptive to changes in the data during runtime, making it appropriate for dynamic environments where the data distribution may shift.
- It tends to work best when the dataset is relatively small or has a relatively simple structure.
Disadvantages
- The algorithm memorizes the entire dataset, which in turn leads to high memory usage. For large datasets, it becomes impractical.
- Such data noises or outliers may really make an impact in predictions, considering KNN works by relying upon the majority class or the average of k-nearest neighbors.
- The performance of KNN depends on the scale of the features as larger scales may dominate in distance calculations.
- The choice of the parameter k (number of neighbors) is critical, and choosing an inappropriate value can lead to suboptimal results. Cross-validation may be needed to find the best k for a given dataset.
Get 100% Hike!
Master Most in Demand Skills Now!
Conclusion
It goes without saying that the KNN algorithm is one of the simplest and most easily comprehensible tools in the sphere of machine learning. It is very easy to understand, implement, and provides a basis for creating various classification and regression models is a very important tool for both new and experienced data scientists. Hope you have got a lot more idea on what is KNN. If you want to learn more about these kind of algorithm, please check out our Data Science Course
Our Machine Learning Courses Duration and Fees
Cohort starts on 18th Jan 2025
₹70,053
Cohort starts on 8th Feb 2025
₹70,053