Classification in Machine Learning

Classification is a fundamental aspect of machine learning, enabling systems to categorize data into predefined classes based on input features. This process is pivotal in various applications, from spam detection to medical diagnosis. In this blog, we’ll delve into the essence of classification, explore prominent algorithms, discuss evaluation metrics, and highlight real-world applications.

Table of Contents

What is Classification in Machine Learning?

Classification is a supervised machine learning technique used for categorizing data into predefined classes. It determines which class a given input belongs to based on historical data and patterns. Classification models are trained using labeled datasets, where each data point is associated with a specific category.

We shall first distinguish between the two categories of learners—lazy and enthusiastic learners—before delving into the classification notion. Next, we will explain the misunderstanding that exists between regression and classification.

1. Eager Learner

Eager Learning describes models that train aggressively, examining the entire dataset and generalizing patterns before making predictions. These models create explicit representations of data and store it in an organized format. Almost all the machine learning algorithms are eager learners. Here are few examples for the same:

2. Lazy Learner

Lazy Learning refers to models that delay learning until the prediction period. These models store the training data as is and make predictions by comparing incoming data to previously stored examples. Some examples of the same are as follows:

- K-Nearest Neighbor.
- Case-based reasoning.

Different Types of Classification

Classification can be divided into several categories based on how the data is classed. Understanding these categories is important for selecting the best algorithm for your specific use case.

1. Binary Classification

Binary classification is the most basic type of classification, in which the model divides inputs into two categories (e.g., Yes/No, Spam/Not Spam, Fraud/Not Fraud). In such cases, the training data is labeled in a binary format: true and false; positive and negative; O and 1; spam and not spam, and so on, depending on the task at hand. Here is the list of algorithms that can be used for Binary Classification:

- Logistic Regression
- Support Vector Machines

2. Multi-Class Classification

Multi-class classification divides inputs into three or more classes, with each input belonging to only one. In this scenario, the purpose is to figure out which class a particular input example belongs to. Most of the binary classification algorithms can be also used for multi-class classification

- Random Forest
- Neural Network
- k-Nearest Neighbors (KNN).
- Gradient Boosting Algorithms (XGBoost, LightGBM)

3. Multi-Label Classification

In multi-label classification problems, we attempt to predict 0 or more classes from each input case. In this scenario, there is no mutual exclusion because the input example can have several labels. Such a scenario can be seen in a variety of disciplines, including auto-tagging in Natural Language Processing, where a single text can contain several themes. Most commonly used algorithms here are:

- Multi-label Decision Trees
- Multi-label Gradient Boosting
- Multi-label Random Forests

Evaluation Metrics for Classification Model

When building classification models in machine learning, choosing the right evaluation metric is essential for understanding how well the model performs. Let’s look them one by one.

1. Accuracy

Accuracy is the most basic evaluation metric, showing the proportion of correct predictions provided by the model. Accuracy is most useful when the class distribution is generally balanced and misclassifying any class costs the same amount. It is calculated using the formula.

2. Precision

Precision measures how many of the model’s positive predictions are right. It is especially crucial when the consequences of false positives are substantial. It is utilized in situations when false positives are costly. For example, in email spam detection, you want to reduce the number of real emails wrongly categorized as spam.

3. Recall (Sensitivity)

Recall, also known as sensitivity, reflects how many positive cases the model accurately recognized. It answers the question, “How many of the true positives did the model capture?” Recall is critical when the cost of false negatives is significant.

4. F1-Score

The F1-Score represents the harmonic mean of precision and recall. It strikes a balance between precision and recall, which is particularly advantageous when the two are imbalanced. F1-Score is desirable when precision and recall must be balanced and there is a class imbalance. It is commonly employed in classification situations where false positives and false negatives have distinct meanings.

5. ROC-AUC Curve

The ROC curve is a graphical representation of a classifier’s performance over all classification levels. The area under the ROC curve (AUC) indicates how well the model distinguishes across classes. AUC values vary from 0 to 1, with 1 indicating perfect categorization.

- The Y-axis shows the True Positive Rate (TPR) (Recall).
- The False Positive Rate (FPR) is represented on the X-axis.

6. Confusion Matrix

A confusion matrix is a table that summarizes the model’s performance by displaying the true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). It helps visualize how the model is performing in terms of the correct and incorrect predictions across all classes.

Popular Classification Algorithms

There are several categorization algorithms, each with advantages, limitations, and best-use scenarios. Let’s look at some of the most popular classification algorithms in machine learning.

1. Logistic Regression

Logistic regression is not a regression model, but rather a classification procedure. It is typically applied to binary classification problems. It operates by estimating probabilities with the logistic (sigmoid) function, and if the probability exceeds a threshold (usually 0.5), the class is labeled as 1; otherwise, it is labeled as 0.

ML13'

2. Decision Tree

A Decision Tree is a tree-like model that divides data according to feature conditions, making it simple to analyze. The algorithm chooses the appropriate feature to split the dataset based on factors such as Gini Impurity and Entropy (Information Gain). Recursively divides the data into subsets until a halt condition is reached.

3. Random Forest

Random Forest is an ensemble learning technique that creates multiple decision trees and then combines their predictions to improve accuracy and generalization. Multiple decision trees are trained using distinct subsets of the data. The predictions from all trees are combined (majority voting for classification).

4. K-Nearest Neighbor

KNN is a lazy learning method that produces predictions by identifying the K closest data points (neighbors) in the training set. It classifies an input according to the most common class among its neighbors. We compute the distance (Euclidean, Manhattan, etc.) between the input and all training points, and then select the K nearest neighbors based on that distance. Following that, we assign the class label with the highest frequency among the neighbors.

5. Naïve Bayes

Naïve Bayes is a probabilistic classifier that relies on Bayes’ Theorem and assumes feature independence, which is often not the case (thus the term “naïve”). It use Bayes’ Theorem to determine the likelihood of each class given an input and assigns the class with the highest probability.

6. Support Vector Machine

SVM is a powerful classification technique that determines the best hyperplane to divide data points into categories. The kernel method works effectively for both linear and nonlinear situations. We maximize the margin between classes by identifying the optimum decision boundary (hyperplane) and using kernels (e.g., polynomial, radial basis function) for non-linearly separable data.

Python Implementation of Classification Algorithm

In this section, we’ll use Python and scikit-learn to implement many popular categorization algorithms. We will use the Iris dataset, which is well-known for classification tasks.

1. Import Necessary Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Importing classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

2. Loading the Dataset

from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Labels

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the data (for better performance of some algorithms)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

3. Implementing Classification Algorithm

3.1. Logistic Regression

# Initialize and train the model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

3.2. Decision Tree Classifier

# Initialize and train the model
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Make predictions
y_pred = dt.predict(X_test)

# Evaluate the model
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

3.3. Random Forest Classifier

# Initialize and train the model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Evaluate the model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

3.4. Naïve Bayes Classifier

# Initialize and train the model
nb = GaussianNB()
nb.fit(X_train, y_train)

# Make predictions
y_pred = nb.predict(X_test)

# Evaluate the model
print("Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

3.5. Support Vector Machine (SVM)

# Initialize and train the model
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

# Make predictions
y_pred = svm.predict(X_test)

# Evaluate the model
print("SVM Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

3.6. K-Nearest Neighbors

# Initialize and train the model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print("KNN Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

4. Visualizing Model’s Performance

# Compute confusion matrix
cm = confusion_matrix(y_test, rf.predict(X_test))

# Plot confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Random Forest')
plt.show()

Conclusion

Classification in Machine Learning is constantly evolving, improving accuracy and efficiency in a wide range of applications. Whether you’re a beginner or an expert, understanding these algorithms and their real-world applications is critical for developing intelligent systems. If you want to learn about these techniques, then you should head over to our Data Science Course!