Gradient Descent Using Python and NumPy

Q: 2. What are the different types of gradient descent?

The 3 types of Gradient Descent are: Batch Gradient Descent (BGD): It uses the entire dataset to compute gradients, which helps update parameters in each iteration. Stochastic Gradient Descent (SGD): It is used to update parameters using one random data point at a time. This makes it faster to update parameters, but also makes the model noisier. Mini-Batch Gradient Descent: It is basically a compromise between BGD and SGD. It helps to update parameters using small batches of data. Therefore, it is able to balance efficiency and stability.

Gradient Descent is one of the most important optimization algorithms in machine learning and deep learning. It is an iterative approach that you can utilize to minimize a function, typically a loss function. This can be done by updating the parameters in the direction of the steepest descent (negative gradient).
In this blog, we are going to teach you about the working of Gradient Descent, and its implementation from scratch using NumPy, and then visualize the results.

Table of Contents

What is Gradient Descent?
Implementation of Gradient Descent from Scratch
Implementing Stochastic, Batch, and Mini-Batch Gradient Descent
Conclusion
FAQs

What is Gradient Descent?

Gradient Descent is an optimization algorithm that helps to adjust the parameters of a model (such as weights and biases). This helps to minimize the loss function. Below are the steps mentioned for the process of Gradient Descent:

Compute the Gradient: You have to calculate the derivative (slope) of the loss function, which concerns each parameter.

Update Parameter: To reduce the loss, you have to move in the opposite direction of the gradient.

Repeat Until Convergence: You have to continue updating the parameters iteratively until the loss stabilizes or reaches a minimum.

Mathematically, the parameters are updated by gradient descent using the formula

where:

𝛳 is used to represent the parameters (weights, biases, etc.).
𝛼 is used to represent the learning rate (step size).
J is used to represent the cost (loss) function.
ძJ/ძ𝛳 is used as the gradient (derivative) of the loss function.

Implementation of Gradient Descent from Scratch

Now let’s have a look at the implementation of Gradient Descent using NumPy. Here, we will use a simple linear regression problem where our aim will be to find the best-fit line:

y = mx + b

Generating Data

Here, we have to create a dataset of points that follow a linear relationship with some noise.

Example:

Python

Output:

Explanation:

The above code is used to generate a synthetic dataset. It will be used for linear regression y = 4 + 3x + noise. It then plots the data points as a scatter plot.

Implementing the Cost Function

For the implementation of the cost function, we use Mean Squared Error:

Example:

Python

Explanation:

The function compute_cost(X, y, m, b) is used to calculate the Mean Squared Error (MSE). This is done by calculating predictions using the linear equation y = mX + b.

This measures the squared differences from actual values, averages them, and returns the cost. It doesn’t produce an output because it hasn’t been called with input values.

Implementation of the Gradient Descent Algorithm

The implementation of gradient descent for the optimization of m and b is given below.

Example:

Python

Explanation:

The function gradient_descent(X, y, m, b, learning_rate, iterations) helps to optimize the parameters m and b. It uses gradient descent by updating them iteratively based on their partial derivatives. It then stores the cost at each step and prints progress after every 10 iterations. But the above code doesn’t generate output because it hasn’t been called with the actual input values.

Running Gradient Descent

Now, you have to initialize m and b randomly using gradient descent.

Example:

Python

Output:

Explanation:

The above code is used to initialize parameters and run the gradient_descent function. It helps to optimize m and b for a linear regression model using a given dataset (X,y), learning rate, and iteration count.

Visualizing the Cost Reduction

Example:

Python

Output:

Explanation:

The above code is used to plot the cost function’s reduction over iterations during gradient descent. This shows how the model’s error decreases as training progresses.

Plotting the Final Regression Line

Example:

Python

Output:

Explanation:

The above code is used to visualize the dataset as a scatter plot. It also overlays the best-fit regression line (learned using gradient descent), which helps to illustrate the linear relationship.

Implementing Stochastic, Batch, and Mini-Batch Gradient Descent

Here, we are going to talk about the different variations of Gradient Descent that handle data in different ways.

Batch Gradient Descent (BGD)

This variation of gradient descent is used for computing gradients using the entire dataset in each iteration.

Example:

Python

Output:

Explanation:

The above code is used to implement Batch Gradient Descent which helps to optimize the parameters (m and b) of a simple linear regression model. This helps to minimize the cost function iteratively over 100 iterations with a learning rate of 0.1. This prints the cost along with the updated values m and b after every 10 iterations.

Stochastic Gradient Descent (SGD)

Unlike BGD, SGD is used to update weights after processing each data point. This makes it faster but noisier.

Example:

Python

Output:

Explanation:

The above code is used to implement Stochastic Gradient Descent (SGD) for Linear Regression. Here it iteratively updates the slope (m) and intercept (b). It uses randomly selected data points, which minimizes the cost function, and prints progress after every 10 iterations.

Choosing the Optimal Learning Rate

It is important to choose the right learning rate in gradient descent.

Example:

Python

Output:

Explanation:

The above code is used to compare how different learning rates (0.001, 0.01, 0.1) affect the convergence of Batch Gradient Descent. It plots the cost reduction over 100 iterations for each rate using different colors. It does not display the plot because plt.show() is missing at the end.

Gradient Descent with Multiple Features

For multiple variables, the formula for gradient descent is

y = X𝛳

Example:

Python

Explanation:

In the above code, the function multi_variate_gradient_descent does not print anything inside it by itself. That’s why the code won’t generate any output.

Implementing Gradient Descent in Real-World Applications

Gradient descent is used in real-world applications like

Deep Learning: Used for optimizing neural networks.
Logistic Regression: Used for classification problems.
Support Vector Machines (SVMs): Used for finding hyperplanes.

Conclusion

From this blog, we can conclude that Gradient Descent is the backbone of machine learning optimization. The key to building effective models is to understand their variations (Batch, SGD, Mini-Batch) and hyperparameter tuning (learning rate, batch size). By using Python and NumPy, you can efficiently implement and visualize the way gradient descent minimizes errors and optimizes predictions. If you want to learn more about this technology, then check out our Comprehensive Data Science Course.

FAQs

1. What is Gradient Descent, and why is it important?

Gradient descent is an optimization algorithm that is used to minimize the cost function. It is used in machine learning and deep learning models. The model’s parameters are updated iteratively by gradient descent by moving in the direction of the steepest descent (negative gradient), which helps to reach the minimum cost. It is used mostly in regression, neural networks, and optimization problems.

2. What are the different types of gradient descent?

The 3 types of Gradient Descent are:

Batch Gradient Descent (BGD): It uses the entire dataset to compute gradients, which helps update parameters in each iteration.

Stochastic Gradient Descent (SGD): It is used to update parameters using one random data point at a time. This makes it faster to update parameters, but also makes the model noisier.

Mini-Batch Gradient Descent: It is basically a compromise between BGD and SGD. It helps to update parameters using small batches of data. Therefore, it is able to balance efficiency and stability.

3. How do we implement Gradient Descent using Python and NumPy?

Given below is a simple implementation of Batch Gradient Descent:

Example:

”python”

Output:

Explanation:

The above is used to visualize the effect of different learning rates (0.001, 0.01, 0.1) on the convergence of Batch Gradient Descent. This is done by plotting the cost reduction over 100 iterations for each learning rate.

4. What is the role of the learning rate in Gradient Descent?

The role of learning rate is to control how much the parameters (weights) are updated in each step of Gradient Descent.

If the algorithm has a high learning rate, it will cause the algorithm to overshoot and diverge.

A very low learning rate results in slow convergence of the model. This forces the model to take too many iterations to reach the minimum.

By choosing an optimal learning rate, it ensures faster and stable convergence.

Example:

”python”

Output:

Explanation:

The above code is used to compare the impact of different learning rates (0.001, 0.01, 0.1) on the convergence of batch gradient descent. This is done by plotting the cost function over 100 iterations for each learning rate by using different colors.

5. How can we improve the performance of Gradient?

To improve the performance of Gradient Descent, you can consider the following steps: