K-Nearest Neighbor Algorithm in Machine Learning

The K-Nearest Neighbors (KNN) algorithm is a general-purpose supervised learning technique applicable to both classification and regression problems. It works by finding the ‘k’ nearest data points to input and predicts based on the majority class (in case of classification) or mean value (in case of regression) among these neighbors. KNN has earned fame for its ease of use and efficiency in diverse applications.

In this blog, we will learn about the KNN algorithm, why we need it, and the types of distance metrics used. Along with these topics, we will also cover the implementation and know the facts behind “why the KNN algorithm is a lazy algorithm”.

Understanding the Need for KNN Algorithm
Distance Metrics in KNN
Selecting the Optimal ‘K’ Value
Working Mechanism of KNN
Implementing KNN in Machine Learning
Advantages and Disadvantages of KNN
Why is KNN a Lazy Algorithm?
Conclusion

Stay Ahead of the Curve

with Our Future-Focused Data Science Certification

Explore Program

Understanding the Need for KNN Algorithm

KNN is easy to understand and simple to use, making it a great tool for novices as well as experts. It is especially helpful in situations where the distribution of data is unknown or complicated since it does not make any assumptions about the data beforehand. By analyzing the closeness of data points, KNN efficiently deals with classification problems where the decision boundaries are non-linear or irregular.

Suppose you have two classes, say A and B, and you meet a new observation, denoted by x. In this case, the task is to determine in which class x falls. And that is where K-Nearest Neighbors algorithms come into play. KNN is an extremely robust classification tool for this kind of thing. This will identify the category or class it is likely to belong to by KNN based on proximity to its neighbors in new data points. The idea of KNN is intuitive for categorizing or labeling unknown data based on how close it is to known data points through the collective influence of its closest data points.

Some of the other reasons why the KNN algorithm is essential are given below:

Firstly, KNN is a multifunctional and simple classification algorithm that can be applied to both classification and regression tasks. It’s particularly useful when the underlying data distribution is not well-defined or linear.
Secondly, KNN doesn’t require assumptions about the underlying data, making it applicable in various scenarios.
Thirdly, it excels in cases where the decision boundaries are complex and non-linear.

Distance Metrics in KNN

KNN uses distance to determine whether data points are similar. We use such a measurement to group similar locations together, similar to how we determine how close or far items are in a neighborhood. The blocks between the houses could represent distance. Thus, the distance is important because it lets us determine the group, or “cluster,” to which any new point belongs. So, if a new house is close to numerous others, it joins their group. Distance measurement in KNN is comparable to determining who your nearest neighbors are; it allows you to easily group related items.

The Euclidean distance measures straight-line distance, whereas the Manhattan distance measures the total absolute differences across each dimension. However, Minkowski distance is a generalized form with a parameter (p) that allows you to change the sensitivity to different dimensions. The choice between these measures is determined by the nature of the data and the problem being addressed. Let’s look at the three distance metrics: Euclidean Distance, Manhattan Distance, and Minkowski Distance.

1. Euclidean Distance

Formula:

Explanation: This metric determines the straight-line distance between two Euclidean locations. It is sensitive to the magnitude of difference between dimensions. For simplicity, in a two-dimensional space, if you have two points (x1, y1) and (x2, y2), the Euclidean distance (d) between them is given by the Pythagorean theorem: ga (or d = sqrt{(x2 – x1)^2 + (y2 – y1)^2})}. This distance metric is sensitive both to the absolute value of vectors and to the direction of vectors

Supercharge Your Data Science Skills

with Our Industry-Recognized Certification

Explore Program

2. Manhattan Distance

Formula:

Explanation: Manhattan or L1 distance or a taxicab geometry is a distance that can work with two points of coordinates (x1, y1) and (x2, y2) by determining the absolute difference of the two values on the x or y-axis. You measure distance in a grid city by adding the horizontal and vertical lengths of paths along the grid lines to reach a specific point. Manhattan distance is measured by the sum of the absolute differences across all dimensions.

3. Minkowski Distance

Formula:

Explanation: Minkowski distance generalizes both Euclidean and Manhattan distances. You can vary the formula with the parameter p. So, when p = 2, it is simply the Euclidean distance; when p = 1, it becomes the Manhattan distance. For other p values, it is simply a more general expression. You regulate the sensitivity of Minkowski distance according to problem needs.

Selecting the Optimal ‘K’ Value

k is perhaps the best parameter of the k-Nearest Neighbors (k-NN) algorithm, and it determines the choice of an object to a large extent. The degree of freedom k in the model involves the selection between the overfitting model and the underfitting model.

Small ‘k’ values (e.g., 1 or 3): Can lead to models sensitive to noise, potentially causing overfitting.
Large ‘k’ values (e.g., 10 or 20): May smooth out noise but risk underfitting by oversimplifying the decision boundary.

To determine the optimal k:

1. Consider Dataset Characteristics

Evaluate the nature of the dataset. Noisy or small datasets might call for a smaller k, while larger datasets can make use of a larger k.

2. Odd Values for Binary Classification

For binary classification tasks, an odd value of k is appropriate to break any ties in voting for the majority class.

3. Cross-Validation

Use cross-validation approaches, for example, k-fold cross-validation, for analyzing the model’s ability when using a different value of k. This determines the best k that may lead to a good balance between overfitting bias and high variance.

4. Grid Search

A grid search is defined over a list of ‘k’, analyzing the effects of each k, respectively, on the general model performance. Select the value of ‘k’ that best shows the maximum accuracy or according to the best evaluation.

5. Visualization

Visualization is the plot of decision boundaries for several ‘k’ values to see which one best represents the data patterns.

Working Mechanism of KNN

According to K-Nearest Neighbors, classify or forecast a new data point in that category of majority class and average of k-nearest neighbors. Let’s examine this step-by-step elaboration.

Step 1: Choose the Value of k

Determine the number of neighbors to consider. This is a crucial parameter that can impact the algorithm’s performance.

Step 2: Calculate Distances

Calculate the distance between the new data point and all points in the training set using a chosen metric. Normally, Euclidean distance, Manhattan distance or other distances are used as per the type of the problem.

Step 3: Identify Neighbors

Select ‘k’ data points from the training set which are closest to the new point computed by step 2.

Step 4: Majority Voting (Classification) or Weighted Averaging (Regression)

The majority class among the k nearest neighbors is determined and assigned to the new data point for classification tasks.

The mean of the target values of k nearest neighbors is calculated to predict the new data point.

Step 5: Make Prediction

Label the pattern as belonging to the class, or having the value predicted by the majority voting or averaging.

Classification: Assign the class most common among the ‘k’ neighbors.
Regression: Compute the average of the target values of the ‘k’ neighbors.

Step 6: Output

In the end, based on a certain threshold, the KNN algorithm provides the final classification or prediction of the new sample.

Step 7: Evaluate Performance

If you’ve labeled the data, you evaluate the algorithm with the help of accuracy, precision, recall, or F1 score.

Unlock Your Potential

with Our Comprehensive Certification Program

Explore Program

Implementing KNN in Machine Learning

Refer to the code below to understand the implementation of KNN algorithm in machine learning:

Step 1 – Import the Libraries

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Step 2 – Load and prepare the data:

# X represents features, and y represents labels
X = [[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]]
y = ['A', 'A', 'A', 'B', 'B', 'B']

Step 3 – Split the data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4 – Initialize and train the KNN classifier:

knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X_train, y_train)

Step 5 – Make predictions on the test set:

y_pred = knn_classifier.predict(X_test)

Step 6 – Evaluate the model’s accuracy:

accuracy = accuracy_score(y_test,
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Step 7 – Make predictions on a new data point:

new_data_point = [[4, 4]]
predicted_class = knn_classifier.predict(new_data_point)
print(f'Predicted class for {new_data_point}: {predicted_class}')

Step 8 – Output:

Accuracy: 67.9

Predicted class for [[4, 4]]: [‘B’]

You’re going to run an instance of the classification model through use of the KNN algorithm. You’re replacing it with your data; for the sample just use anything that has correspondences that would be their features-X and labels – y. “n_neighbors” would determine the no of neighboring consideration.

Advantages and Disadvantages of KNN

Understanding these pros and cons is essential when deciding whether KNN is suitable for a particular task and dataset or not. Below are some of the pros and cons of the KNN algorithm.

Advantages

Simple and easy to understand: The algorithm is very easy to execute, so it is a perfect option for newcomers.
No training phase required: Unlike other machine learning models, KNN does not involve an obvious training stage and is therefore a “lazy learner.”
Non-parametric nature: KNN has no assumption regarding the underlying data distribution, hence is very versatile across various datasets.
Works well with non-linear data: It can capture intricate decision boundaries well, as opposed to linear models.
Adaptability to real-time changes: Since KNN retains all training data, it is easy to adapt to new data without having to retrain the model from scratch.

Disadvantages

Computationally expensive: Because KNN retains every training sample, prediction involves calculating the distance to every training sample and is slow on big data sets.
Memory-intensive: Having to hold all the training data makes it not suitable for datasets with millions of rows.
Sensitive to irrelevant features: In case the dataset contains redundant features, they will adversely affect the distance computation, decreasing accuracy.
Impact of outliers: KNN is not resistant to noise and outliers because they have a great impact on the nearest neighbors.
Difficulty in choosing ‘k’: It is important to select the appropriate number of neighbors (‘k’), as selecting too small or too big a value can result in overfitting or underfitting.

Get 100% Hike!

Master Most in Demand Skills Now!

Why is KNN a Lazy Algorithm?

The KNN algorithm is termed a “lazy” algorithm because it does not build a generalized model during training. In a lazy algorithm, the model is not trained on the dataset. It instead memorizes all of the data. Training data is processed only when a new, unseen data point needs to be classified or predicted.

For KNN, in the training period, it just loads the training data into its memory. The algorithm calculates the distances between the new data point and all points in the training set to predict it. It then identifies the k-nearest neighbors relative to those distances and predicts according to the mode of the nearest class or mean of the target value.

The term “lazy” is used to highlight that the algorithm doesn’t actively learn a model during the training phase; its the learning until the prediction phase when the specific instance needs to be classified. This characteristic makes KNN simple and flexible but can also lead to higher computational costs during prediction, especially with large datasets.

No coding background needed. Our Machine Learning training starts with Python basics and builds up to real ML projects.

Conclusion

K-Nearest Neighbors (KNN) is a basic yet effective machine learning algorithm for classification and prediction by similarity. Its performance is highly dependent on the distance metric, k (number of neighbors), and dataset size. Although simple to implement, KNN’s computational expense can be high for large datasets. Despite this, it’s a basic algorithm that every data scientist should know. If you want to learn more about these kinds of algorithms, please check out our Data Science Course.