Introduction to Linear Regression in Python
Linear regression is a supervised machine learning algorithm that is used to predict a continuous value based on a set of independent variables. What is regression? Regression is a simple yet powerful technique that can be used to solve a variety of problems, such as predicting house prices, sales figures, and customer behavior.
Here’s an interesting video on What is Linear Regression:
Without much delay, let’s get started.
What is Linear Regression?
As mentioned above, linear regression is a predictive modeling technique. It is used whenever there is a linear relation between the dependent and independent variables.
Y = b0 + b1* x
It is used to estimate exactly how much of y will change when x changes a certain amount.
As we see in the picture, a flower’s sepal length is mapped onto the x-axis, and the petal length is mapped onto the y-axis. Let us try to understand how the petal length changes with respect to the sepal length with the help of linear regression. Let us have a better understanding of linear regression with another example given below.
Example:
Let’s assume there is a telecom network called Neo. Its delivery manager wants to find out if there’s a relationship between the monthly charges of a customer and the tenure of the customer. So, he collects all customer data and implements linear regression by taking monthly charges as the dependent variable and tenure as the independent variable. After implementing the algorithm, what he understands is that there is a relationship between the monthly charges and the tenure of a customer. As the tenure of the customer increases, the monthly charges also increase. Now, the best-fit line helps the delivery manager find out more interesting insights from the data. With this, he can predict the value of y for every new value of x.
Let us say the tenure of a customer is 45 months, and with the help of the best-fit line, the delivery manager can predict that the customer’s monthly charges would be around $64.
Similarly, if the tenure of a customer is 69 months, with the help of the best-fit line, the delivery manager can predict that the customer’s monthly charges would be around $110.
This is how linear regression works. Now, the question is how to find the best-fit line.
Linear Regression Line of Best Fit
The line of best fit is nothing but the line that best expresses the relationship between the data points. Let us see how to find the best fit line in linear regression.
This is where the residual concept comes into play that is shown in the image below:
The red lines in the above image denote residual values, which are the differences between the actual values and the predicted values. How does residual help in finding the best-fit line?
To find out the best-fit line, we have something called the residual sum of squares (RSS). In RSS, we take the square of the residuals and sum them up.
The line with the lowest value of RSS is the best-fit line.
Now, let us see how the coefficient of x influences the relationship between the independent and dependent variables.
Regression Coefficient
In simple linear regression, if the coefficient of x is positive, we can conclude that the relationship between the independent and dependent variables is positive.
Here, if the value of x increases, the value of y also increases.
Now, if the coefficient of x is negative, we can say that the relationship between the independent and dependent variables is negative.
Here, if the value of x increases, the value of y decreases.
Now, let us see how we can apply these concepts to build linear regression models. In the below given Python linear regression examples, we will be building two machine learning models for simple and multiple linear regression. Let’s begin.
Practical Application: Linear Regression with Python’s Scikit-learn
- Dataset in Focus: Boston Housing Price Records
- Environment: Python 3 and Jupyter Notebook
- Library: Pandas
- Module: Scikit-learn
Dataset Overview
Before diving into the linear regression exercise using Python, it’s crucial to familiarize ourselves with the dataset. We’ll be analyzing the Boston Housing Price Dataset, which comprises 506 entries and 13 attributes, along with a target column. Let’s briefly inspect this dataset.
Let’s take a quick look at the dataset columns:
- Crim: Crime rate per capita by town
- Zn: Fraction of residential land allocated for large plots (over 25,000 sq. ft.)
- Indus: Fraction of non-retail business acres in town
- Chas: Indicator for Charles River proximity (1 if close; 0 if not)
- Nox: PPM (parts per 10 million) concentration of nitrogen oxides
- Rm: Typical number of rooms in a residence
- Age: Fraction of homes built before 1940
- Dis: Average distance to five major Boston workplaces
- Rad: Proximity index to major highways
- Tax: Property tax rate (per $10,000)
- Ptratio: Student-to-teacher ratio in town
- Black: Value calculated as 1000(Bk – 0.63)^2, where Bk represents the fraction of Black residents in town
- Lstat: Percentage of the population with lower status
- Medv: Median price of homes (in $1000s)
In this linear regression tutorial, our objective is to develop two predictive models for housing prices.
Model Development
With a clear understanding of our dataset, let’s proceed to construct our linear regression models in Python.
Univariate Linear Regression in Python
Take ‘lstat’ as independent and ‘medv’ as dependent variables or Using ‘lstat’ as the predictor and ‘medv’ as the response:
Step 1: Initialize the Boston dataset
Step 2: Examine dataset dimensions
Step 3: Preview predictor and response variables
Step 4: Visualize variable trends
Step 5: Segregate data into predictors and responses
Step 6: Partition data for training and testing
Step 7: Review dimensions of training and test datasets
Step 8: Start the model training
Step 9: Extract the y-intercept
Step 10: Extract the regression coefficient
Step 11: Generate predictions
Step 12: Compare with actual values
Step 13: Assess model performance
Multivariate Linear Regression in Python
Here, consider ‘medv’ as the dependent variable and the rest of the attributes as independent variables or using ‘medv’ as the response and all other attributes as predictors:
Step 1: Initialize the Boston dataset
Step 2: Define response and predictor variables
Step 3: Preview predictor variables
Step 4: Preview response variable
Step 5: Partition data for training and testing
Step 6: Review dimensions of training and test datasets
Step 7: Commence model training
Step 8: Examine chosen model coefficients
Step 9: Contrast predicted with actual values
Step 10: Comparing the predicted value to the actual value:
Step 11: Assess model performance
What Did We Learn?
In this module, we have covered the basics of linear regression in Python, including the best-fit line, the coefficient of x, and how to build simple and multiple linear regression models using sklearn. In the next module, we will discuss logistic regression, which is a type of regression analysis that is used to predict the probability of an event occurring.