• Articles
  • Tutorials
  • Interview Questions

 Regularization in Machine Learning

 Regularization in Machine Learning

In this blog, we’ll understand the issues of overfitting and underfitting, along with gaining a solid grasp of bias and variance. Subsequently, we’ll investigate the concept of Regularization in Machine Learning, understanding its functionality, various techniques, and how to implement it using Python code.

Table of Content

Watch this complete course video on Machine Learning

Video Thumbnail

What are Overfitting and Underfitting?

Overfitting and underfitting are two common problems encountered when building machine learning models. Let’s break down these concepts:

  • Overfitting: Overfitting happens when a machine learning model studies the training data too much. It learns not only the important stuff but also the random things or mistakes in the data. So, the model gets super good at the training data but doesn’t do well with new data it hasn’t seen before. It’s like the model memorizes the training stuff really well but doesn’t really understand the main ideas. That’s why it doesn’t work great when used in real-life situations.
  • Underfitting: Underfitting occurs when a model is too basic to grasp the main patterns in the training data. It doesn’t really understand the important connections and doesn’t do well, not just on the training data but also on the new data. This kind of model is too simple to show the full picture of the data, which makes it not work so well overall.

What are Bias and Variance?

Bias and variance are two basic concepts in understanding the predictive performance and behavior of machine learning models:

Bias: Bias is the mistake made by a model when it tries to simplify a real problem. It shows how much the model consistently misses the actual values. A model with high bias oversimplifies the underlying relationships in the data and tends to make strong assumptions. This often leads to consistent but inaccurate predictions. In simpler terms, high bias implies that the model is too basic to capture the complexities present in the data. For instance, a linear regression model applied to a highly nonlinear dataset might exhibit high bias.

Variance: Variance is about how much a model changes with different training data. It shows how much the model gets affected by random stuff in the data. A model with high variance is overly sensitive to the training set and captures both the underlying patterns and the noise. Such models might perform exceptionally well on the training data but fail to generalize to new, unseen data. High variance indicates that the model is too complex and has learned specific patterns from the training data that don’t apply to other datasets.

What is Regularization in Machine Learning?

Regularization in machine learning is a way to stop models from getting too focused on the training data. It’s like adding some rules or penalties while the model learns, making sure it doesn’t get too complex. The aim is to strike a balance so that the model works well not just on the training data but also on new and unfamiliar data.

How Regularization Works in Machine Learning

Here’s a step-by-step procedure explaining how regularization operates in machine learning:

  • Model Training Initiation: Start with a machine learning model (e.g., linear regression, neural network) that needs training on a dataset.
  • Standard Cost Function: Initially, the model uses a standard cost function, aiming to minimize errors between predicted and actual values in the training data.
  • Introducing Regularization: Add regularization to the cost function by appending penalty terms.
  • Penalty Term Addition: Include penalties based on model parameters to control complexity.
  • Parameter Modification: During model training, update parameters (weights or coefficients) iteratively and minimize the combined original error term and the added regularization term.
  • Types of Regularization: Choose the type of regularization:
    • L1 (Lasso): Penalizes based on absolute parameter values.
    • L2 (Ridge): Penalizes based on squared parameter values.
  • Hyperparameter Tuning: Set the hyperparameter (λ for L1/L2) to control the regularization strength and tune the hyperparameter through techniques like cross-validation to find the optimal value.
  • Bias-Variance Control: Regularization aims to manage the bias-variance trade-off by preventing models from being too simple (high bias) or too complex (high variance) and seeks an optimal balance between fitting the training data and generalizing to new data.
  • Training Completion: Continue the training until convergence, where the model reaches a point of minimal error on the training data while controlling complexity.
  • Algorithm-Agnostic Application: Regularization is adaptable across various machine learning algorithms. It ensures that models generalize well to new data, maintaining a balance between complexity and accuracy.

Hence, this is the following step-by-step procedure for working on Regularization in Machine learning.

Get 100% Hike!

Master Most in Demand Skills Now!

Regularization Techniques in Machine Learning

Here are some common regularization techniques to prevent overfitting and improve a model’s generalization to new, unseen data:

L1 Regularization (Lasso)

  • L1 regularization adds a penalty to the model’s coefficients proportional to their absolute values.
  • Encourages sparsity and feature selection by pushing some coefficients to zero.
  • Useful for feature selection in datasets with many irrelevant or redundant features.

Equation for L1 regularization in linear regression:

Cost function = RSS (Residual Sum of Squares) + λ * Σ|β|

Here, RSS represents the standard sum of squared errors, Σ|β| is the sum of absolute values of the model coefficients (β), and λ (lambda) controls the strength of regularization.

L2 Regularization (Ridge)

  • L2 regularization adds a penalty term based on the squared magnitudes of the model’s coefficients.
  • Penalizes large coefficients, promoting more balanced and stable weights across features.
  • Effective in reducing overfitting by preventing extreme parameter values.

Equation for L2 regularization in linear regression:

Cost function = RSS + λ * Σ(β^2)

Here, Σ(β^2) represents the sum of squared coefficients, and λ controls the strength of regularization.

Elastic Net Regularization

  • Combines L1 and L2 regularization by adding both penalties to the model’s cost function.
  • Helps in addressing the limitations of L1 and L2 by incorporating their advantages.
  • Particularly useful when dealing with multicollinearity among features.

Equation for Elastic Net regularization in linear regression:

Cost function = RSS + λ1 * Σ|β| + λ2 * Σ(β^2)

Here, RSS represents the sum of squared errors, Σ|β| is the sum of absolute values of coefficients, and λ1 and λ2 control the strengths of L1 and L2 regularization, respectively.

Difference Between Ridge Regression and Lasso Regression

Here’s a comparison between Ridge Regression and Lasso Regression:

CriteriaRidge RegressionLasso Regression
Type of RegularizationL2 (Euclidean norm)L1 (Manhattan norm)
Penalty TermΣ(β^2)Σ
Encourages SparsityNoYes
Feature SelectionLess likely to perform feature selectionLikely to perform feature selection
Solution BehaviorTends to shrink coefficients moderatelyCan force some coefficients to be exactly zero
Computational ComplexityTypically computationally less expensiveCan be more computationally expensive

Regularization Using Python in Machine Learning

Python provides various libraries and tools to implement regularization techniques. Here’s an example using scikit-learn, a popular machine learning library, to demonstrate regularization:

Ridge Regression Example:

from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
# Load the Boston housing dataset
data = load_boston()
X, y = data.data, data.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train Ridge Regression model with regularization parameter (alpha)
alpha = 0.1  # Adjust this value to control regularization strength
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
# Evaluate the model
train_preds = ridge.predict(X_train)
test_preds = ridge.predict(X_test)
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
print(f"Train MSE: {train_mse:.2f}")
print(f"Test MSE: {test_mse:.2f}")

Lasso Regression Example

from sklearn.linear_model import Lasso
# Train Lasso Regression model with regularization parameter (alpha)
alpha = 0.1  # Adjust this value to control regularization strength
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
# Evaluate the model
train_preds_lasso = lasso.predict(X_train)
test_preds_lasso = lasso.predict(X_test)
train_mse_lasso = mean_squared_error(y_train, train_preds_lasso)
test_mse_lasso = mean_squared_error(y_test, test_preds_lasso)
print(f"Train MSE (Lasso): {train_mse_lasso:.2f}")
print(f"Test MSE (Lasso): {test_mse_lasso:.2f}")

These examples illustrate how to use Ridge and Lasso Regression from scikit-learn in Python. You can adjust the alpha parameter to control the strength of regularization, thereby observing the impact on model performance.

When to Use Which Regularization Technique?

Choosing between regularization techniques like Ridge and Lasso depends on the characteristics of your dataset and the trade-offs you’re willing to make in your model. Here’s a guide to help decide when to use each technique:

Use Ridge Regression when:

  • Dealing with Multicollinearity: When there’s multicollinearity among features (high correlation), Ridge Regression handles it better by shrinking coefficients but not eliminating them entirely.
  • Stability of Coefficients: If you prefer more stable coefficients across different samples, Ridge tends to keep all coefficients at smaller values.

Use Lasso Regression when:

  • Feature Selection is Essential: When you have many features and you want some to be completely eliminated, Lasso tends to force some coefficients to zero, effectively performing feature selection.
  • Simplifying the Model: If interpretability is vital and you desire a simpler model with fewer features, Lasso can provide a sparse model with fewer non-zero coefficients.

Consider Elastic Net when:

  • Combining Ridge and Lasso Benefits: When you want to use the advantages of both Ridge and Lasso, Elastic Net combines their effects, offering a balance between feature selection and coefficient stability.
  • Dealing with Multicollinearity and Feature Selection: It’s useful when facing multicollinearity and desiring some level of feature selection, providing a middle ground between Ridge and Lasso.

Summing up

Regularization in machine learning has proven to be a fundamental pillar in constructing models that strike a delicate balance between complexity and generalization. Its role in preventing overfitting and enhancing a model’s ability to generalize to new data is pivotal in the field of machine learning.

About the Author

Principal Data Scientist

Meet Akash, a Principal Data Scientist with expertise in advanced analytics, machine learning, and AI-driven solutions. With a master’s degree from IIT Kanpur, Aakash combines technical knowledge with industry insights to deliver impactful, scalable models for complex business challenges.