Logistic regression is a fundamental classification method in machine learning that is widely used in fields including finance, healthcare, and marketing. It is essential for predictive modeling, since it helps in spam identification, medical diagnosis, customer churn prediction, and credit risk assessment. Unlike linear regression, which predicts continuous values, logistic regression calculates probabilities, making it an essential tool for binary and multi-class categorization.
In this article, we’ll look at the structure, assumptions, types, and implementation of logistic regression with Python’s Scikit-learn module. By the conclusion, you will have a good understanding of logistic regression and how to apply it effectively in practical situations.
Table of Contents
What is Logistic Regression?
Logistic regression is a supervised machine learning technique that primarily performs classification problems. It predicts the likelihood of an instance belonging to a specific class and is often used in problems with binary classification (for example, Yes/No, Spam/Not Spam). The model generates probabilities, which are assigned to discrete classes based on a threshold (e.g., 0.5).
1. Real World Applications
- Medical Diagnosis: Predicting if a patient has an illness (positive or negative).
- Credit Scoring: Determining if a loan applicant is likely to default.
- Spam Detection: It involves categorizing emails as spam or non-spam.
- Customer Churn Prediction: Determining which customers are likely to depart a subscription service.
- Fraud Detection: Identifying fraudulent transactions based on user behavior.
Assumptions of Logistic Regression
Certain assumptions must be met before logistic regression may make accurate predictions. We will look into logistic regression assumptions because understanding them is critical to ensuring that we are using the model correctly.
- Binary or Multi-Class Output: The target variable must be categorical.
- No multicollinearity: Independent variables should not be strongly correlated.
- Independent Observations: Every data item should be independent of the others.
- Linearity of Log-Odds: Independent variables should be linearly related to the dependent variable’s log-odds.
- Large Sample Size: The model works best with a huge dataset.
Sigmoid function in Logistic Regression
The sigmoid function is a non-linear function that is used to transform the output of the logistic regression model into a probability. It’s the core of logistic regression, transforming raw input values into probabilities.
where z represents the weighted sum of input features:
z = w0 + w1x1 + w2x2 +. . . . .+ wnxn
1. Why is the Sigmoid Function Important?
- Ensures outputs remain within the range 0 to 1.
- Facilitates classification by setting a probability threshold.
- Helps interpret probabilities in a meaningful way.
How Does Logistic Regression Work?
Logistic regression is a member of the regression family as it predicts outcomes based on quantifiable connections among variables. However, unlike linear regression, it accepts both continuous and discrete data as inputs and produces qualitative results. It also predicts an independent class, such as “Yes/No” or “Customer/Non-Customer”.
In action, logistic regression analyzes the correlations between variables. It uses the Sigmoid function to assign probabilities to discrete possibilities, converting numerical outputs into probability expressions ranging from 0 to 1.0. The probability of an event occurring is either 0 or 1. To make binary predictions, divide the population into two categories using a limit of 0.5. Everything above 0.5 is classified as belonging to group A, and everything below it as belonging to group B.
1. Step-by-Step Process
- Calculate Weighted Sum: The linear combination of input attributes and weights.
- Apply the Sigmoid Function: Convert the weighted total to a probability.
- Set a Threshold: Classify the outcome using a predefined probability threshold.
- Optimize Weights: Use Gradient Descent to reduce the cost function.
2. Cost Function in Logistic Regression
Logistic regression uses the log-loss (cross-entropy loss) function to measure error:
Types of Logistic Regression
1. Binary Logistic Regression
A binary logistic regression model represents the relationship between a collection of independent factors and a binary dependent variable. It is used when there are two possible outcomes (for example, spam vs. non-spam). It is used when there are two possible outcomes.
2. Multinomial Logistic Regression
Multinomial logistic regression (also known as “multinomial regression”) predicts a nominal dependent variable from one or more independent variables. It is often thought of as an extension of binomial logistic regression that allows for a dependent variable with more than two categories. It is used for three or more unordered categories.
3. Ordinal Logistic Regression
Ordinal logistic regression can be used to determine the relationship between predictors and an ordinal result. An ordinal variable is a categorical variable whose values follow a natural ordering (for example, depression is classified as Minimal, Mild, Moderate, Moderately Severe, and Severe). It is used for three or more ordered categories.
Difference Between Linear and Logistic Regression
Feature | Linear Regression | Logistic Regression |
Output Type | Continuous | Categorical |
Application | Predicting numerical values | Classification problems |
Error Metric | Mean Squared Error (MSE) | Log Loss (Cross-Entropy) |
Implementation of Logistic Regression using sklearn
Step 1: Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Step 2: Load Dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
Step 3: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Model
model = LogisticRegression()
model.fit(X_train, y_train)
Step 5: Make Predictions
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Evaluation of Logistic Regression
1. Accuracy
The most widely used statistic is accuracy, which measures the overall correctness of the model’s predictions. It is calculated as (TP + TN) divided by (TP + TN + FP + FN).
2. Precision
Precision evaluates the model’s ability to accurately forecast the positive class. It is calculated as TP/(TP + FP). Precision is advantageous when the cost of false positives is large.
3. Recall (Sensitivity/True Positive Rate)
Recall measures the model’s ability to accurately identify positive class instances. It is calculated as TP/(TP + FN). Recall is critical when the cost of false negatives is significant.
4. Specificity
Specificity measures the model’s ability to accurately detect negative class occurrences. It is calculated as TN/(TN + FP). Specificity is critical when the cost of false positives is large.
5. F1 Score
The F1 score combines precision and recall into one statistic. It is the harmonic mean of precision and recall, giving a balanced metric. The F1 score is derived as 2 * (Precision * Recall)/(Precision + Recall).
Advantages and Limitations of Logistic Regression
1. Advantages of Logistic Regression
- Logistic regression is straightforward to apply and interpret. The coefficients give information on feature importance, making them useful for explainable AI.
- It is extremely effective at dealing with binary classification challenges, such as spam detection or medical diagnosis.
- Compared to complicated models such as neural networks, logistic regression takes less computer resources, making it appropriate for huge datasets.
- If the dataset is linearly separable, logistic regression performs extremely well and rarely requires further adjustments.
2. Limitations of Logistic Regression
- Logistic regression is based on the assumption that independent variables have a linear connection with the log-odds of an outcome. It struggles with nonlinear patterns unless feature engineering is used.
- Outliers can have a major impact on the performance of logistic regression because it is based on maximum likelihood estimate. Regularization techniques such as L1 (Lasso) and L2 (Ridge) regression can help to address this issue.
- Small datasets can result in overfitting or erroneous estimates. Logistic regression works best on large, well-balanced datasets.
- Logistic regression is fundamentally binary, necessitating additions such as One-vs-All (OvA) or Softmax regression for multi-class issues.
Conclusion
Logistic regression is a powerful and interpretable classification algorithm widely used in machine learning. Understanding its sigmoid function, cost function, assumptions, and implementation equips you to apply it effectively in real-world scenarios. If you want to learn about these techniques, then you should definitely check out our Data Science Course today!
Our Machine Learning Courses Duration and Fees
Cohort starts on 15th Mar 2025
₹70,053