Linear Regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It fits a straight line to predict outcomes based on input data. Commonly used in trend analysis and forecasting, it helps in making data-driven decisions across various industries like finance, healthcare, and marketing.
Table of Contents
Introduction to Linear Regression
Linear regression is a predictive modeling technique. It is used whenever there is a linear relation between the dependent and independent variables.
It is used to estimate exactly how much of “y” will change when “x” changes a certain amount.
Y = b0 + b1* x
As we see in the picture, a flower’s sepal length is mapped onto the x-axis, and the petal length is mapped onto the y-axis.
1. Understanding Linear Regression
Neo, a telecom network, wants to analyze the relationship between customer tenure and monthly charges. The delivery manager applies linear regression, using tenure as the independent variable and monthly charges as the dependent variable. The results show a positive correlation—longer tenure leads to higher charges. The best-fit line helps predict future charges based on tenure.
Let us say the tenure of a customer is 45 months, and with the help of the best-fit line, the delivery manager can predict that the customer’s monthly charges would be around $64.
Similarly, if the tenure of a customer is 69 months, with the help of the best-fit line, the delivery manager can predict that the customer’s monthly charges would be around $110.
This is how linear regression works. Now, the question is how to find the best-fit line.
What is the Best Fit Line?
The line of best fit is nothing but the line that best expresses the relationship between the data points. Let us see how to find the best fit line in linear regression.
This is where the residual concept comes into play that is shown in the image below:
The red lines in the above image denote residual values, which are the differences between the actual values and the predicted values.
1. How does Residual help in finding the best fit line?
To find out the best-fit line, we have something called the residual sum of squares (RSS). In RSS, we take the square of the residuals and sum them up.
The line with the lowest value of RSS is the best-fit line.
2. How the coefficients influences the relationship between the independent and dependent variables.
In simple linear regression, if the coefficient of x is positive, we can conclude that the relationship between the independent and dependent variables is positive.
Here, if the value of x increases, the value of y also increases.
Now, if the coefficient of x is negative, we can say that the relationship between the independent and dependent variables is negative.
Here, if the value of x increases, the value of y decreases.
Cost Function of Linear Regression
A cost function quantifies the error between the predicted and actual values in a model. In Linear Regression, the most commonly used cost function is Mean Squared Error (MSE).
Evaluation Metrics for Linear Regression
Evaluation metrics measure the quality of a statistical or machine learning model. Key metrics include R-squared, Adjusted R-squared, MSE, RMSE, and MAE, helping assess model accuracy and predictive performance.
1. R-squared (R2)
R2 measures the proportion of variance in the dependent variable explained by the independent variables. A value closer to 1 indicates a better fit, while 0 means the model explains no variance. However, adding more variables can artificially increase R2.
2. Adjusted R-squared
Unlike R2, Adjusted R-squared accounts for the number of predictors in the model. It penalizes unnecessary variables, making it more reliable for multiple regression.
3. MSE (Mean Squared Error)
MSE calculates the average of squared differences between actual and predicted values. A lower MSE means better model performance, but it penalizes large errors more than small ones.
4. RMSE (Root Mean Squared Error)
RMSE is the square root of MSE, making it more interpretable as it is in the same unit as the target variable. Lower RMSE indicates a better model.
5. MAE (Mean Absolute Error)
MAE measures the average absolute difference between actual and predicted values. Unlike MSE, it treats all errors equally and is less sensitive to outliers.
Types of Linear Regression
Linear Regression can be categorized into different types based on the number of independent variables and the nature of the data.
1. Simple Linear Regression
Simple linear regression is useful for predicting and understanding correlations between one independent variable and one dependent variable.
Y = m*x + c
2. Multiple Linear Regression
Multiple regression is similar to linear regression, but it includes more than one independent value, implying that we attempt to predict a value based on two or more variables.
3. Polynomial Regression
Polynomial regression is a type of regression analysis that uses the independent variable’s higher-degree functions, such as squares and cubes, to fit the data. It allows for more intricate interactions between variables than linear regression.
Python Implementation of Linear Regression
Before diving into the linear regression exercise using Python, it’s crucial to familiarize ourselves with the dataset. We’ll be analyzing the Boston Housing Price Dataset, which comprises 506 entries and 13 attributes, along with a target column. Let’s briefly inspect this dataset.
1. Data Description
- Crim: Crime rate per capita by town
- Zn: Fraction of residential land allocated for large plots (over 25,000 sq. ft.)
- Indus: Fraction of non-retail business acres in town
- Chas: Indicator for Charles River proximity (1 if close; 0 if not)
- Nox: PPM (parts per 10 million) concentration of nitrogen oxides
- Rm: Typical number of rooms in a residence
- Age: Fraction of homes built before 1940
- Dis: Average distance to five major Boston workplaces
- Rad: Proximity index to major highways
- Tax: Property tax rate (per $10,000)
- Ptratio: Student-to-teacher ratio in town
- Black: Value calculated as 1000(Bk – 0.63)^2, where Bk represents the fraction of Black residents in town
- Lstat: Percentage of the population with lower status
- Medv: Median price of homes (in $1000s)
In this linear regression tutorial, our objective is to develop two predictive models for housing prices.
2. Model Development
With a clear understanding of our dataset, let’s proceed to construct our linear regression models in Python.
Univariate Linear Regression in Python
Take ‘lstat’ as independent and ‘medv’ as dependent variables or Using ‘lstat’ as the predictor and ‘medv’ as the response:
Step 1: Load the Boston Dataset
import pandas as pd
data = pd.read_csv("Boston1.csv")
Step 2: Data Inspection Phase
data.head()
Step 3: Preview predictor and response variables
data = data.loc[:, ["lstat", "medv"]
Step 4: Visualize variable trends
import matplotlib.pyplot as plt
data.plot(x = "lstat", y = "medv", style = "o")
plt.xlabel("lstat")
plt.ylabel("medv")
plt.show()
Step 5: Segregate data into predictors and responses
x = pd.DataFrame(data["lstat"])
y = pd.DataFrame(data["medv"])
Step 6: Partition data for training and testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1
Step 7: Review dimensions of training and test datasets
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
Step 8: Start the model training
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
Step 9: Extract the y-intercept
print(regressor.intercept_)
Step 10: Extract the regression coefficient
print(regressor.coef_)
Step 11: Generate predictions
print(y_pred)
Step 12: Compare with actual values
y_test
Step 13: Assess model performance
import numpy as np
from sklearn import metrics
print("Mean Absolute Error", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error", np.sqrt(metrics.mean_absolute_error(y_test, y_pred)))
Applications of Linear Regression
Linear Regression is widely used in various domains for predictive analysis and decision-making. Here are some key applications:
1. Evaluating trends and sales estimates
Businesses use Linear Regression to examine past sales data and forecast future trends. Businesses may optimize inventories, marketing efforts, and revenue estimates by understanding how factors such as pricing, promotions, and client demand affect sales.
2. Market analysis
Market analysts use Linear Regression to analyze consumer behavior, competitive pricing, and demand variations. It assists organizations in determining how variables such as advertising expenditure, economic conditions, and customer preferences affect product success in the market.
3. Healthcare
In medicine, Linear Regression is used to forecast disease development, patient recovery rates, and treatment results based on age, lifestyle, and medical history. It is also useful for calculating healthcare expenses and optimizing resource allocation.
4. Forecasting consumer spending
Economists and businesses use Linear Regression to forecast consumer spending based on income, inflation, and interest rates. This aids in budgeting, price tactics, and financial planning to keep up with market demand.
Advantages and Disadvantages of Linear Regression
1. Advantages of Linear Regression
- Linear Regression is easy to understand and interpret, making it a great starting point for statistical modeling.
- It requires minimal computational power, making it ideal for large datasets.
- If there is a linear relationship between the independent and dependent variables, Linear Regression performs well.
- Helps identify the impact of each independent variable on the dependent variable using coefficients.
- Frequently used in forecasting sales, trends, and other business metrics.
2. Disadvantages of Linear Regression
- Linear Regression assumes a straight-line relationship, which may not hold for complex, non-linear data.
- Outliers can significantly affect the regression line, leading to inaccurate predictions.
- When independent variables are highly correlated, it can distort coefficient estimates and reduce model reliability.
- It does not handle categorical data well unless properly encoded.
- With too many features, the model may fit the training data too well but perform poorly on new data.
Conclusion
Linear Regression is a powerful yet simple technique for analyzing relationships and making predictions. While it has limitations, its efficiency and interpretability make it a key tool in data science. If you want to master Linear Regression and other ML techniques, then you should join our Data Science course today!
Our Data Science Courses Duration and Fees
Cohort starts on 16th Mar 2025
₹69,027
Cohort starts on 23rd Mar 2025
₹69,027