What is Linear Regression?

Linear Regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It fits a straight line to predict outcomes based on input data. Commonly used in trend analysis and forecasting, it helps in making data-driven decisions across various industries like finance, healthcare, and marketing.

Table of Contents

Introduction to Linear Regression
What is the Best Fit Line?
Cost Function for Linear Regression
Evaluation Metrics for Linear Regression
Types of Linear Regression
Python Implementation of Linear Regression
Applications of Linear Regression
Advantages and Disadvantages of Linear Regression
Conclusion

Introduction to Linear Regression

Linear regression is a predictive modeling technique. It is used whenever there is a linear relation between the dependent and independent variables.

It is used to estimate exactly how much of “y” will change when “x” changes a certain amount.

Y = b₀ + b₁* x

As we see in the picture, a flower’s sepal length is mapped onto the x-axis, and the petal length is mapped onto the y-axis.

1. Understanding Linear Regression

Neo, a telecom network, wants to analyze the relationship between customer tenure and monthly charges. The delivery manager applies linear regression, using tenure as the independent variable and monthly charges as the dependent variable. The results show a positive correlation—longer tenure leads to higher charges. The best-fit line helps predict future charges based on tenure.

Let us say the tenure of a customer is 45 months, and with the help of the best-fit line, the delivery manager can predict that the customer’s monthly charges would be around $64.

Similarly, if the tenure of a customer is 69 months, with the help of the best-fit line, the delivery manager can predict that the customer’s monthly charges would be around $110.

This is how linear regression works. Now, the question is how to find the best-fit line.

What is the Best Fit Line?

The line of best fit is nothing but the line that best expresses the relationship between the data points. Let us see how to find the best fit line in linear regression.

This is where the residual concept comes into play that is shown in the image below:

The red lines in the above image denote residual values, which are the differences between the actual values and the predicted values.

1. How does Residual help in finding the best fit line?

To find out the best-fit line, we have something called the residual sum of squares (RSS). In RSS, we take the square of the residuals and sum them up.

The line with the lowest value of RSS is the best-fit line.

2. How the coefficients influences the relationship between the independent and dependent variables.

In simple linear regression, if the coefficient of x is positive, we can conclude that the relationship between the independent and dependent variables is positive.

Regression Coefficient Positive Relation

Here, if the value of x increases, the value of y also increases.

Now, if the coefficient of x is negative, we can say that the relationship between the independent and dependent variables is negative.

Regression Coefficient Negative Relation

Here, if the value of x increases, the value of y decreases.

Cost Function of Linear Regression

A cost function quantifies the error between the predicted and actual values in a model. In Linear Regression, the most commonly used cost function is Mean Squared Error (MSE).

Evaluation Metrics for Linear Regression

Evaluation metrics measure the quality of a statistical or machine learning model. Key metrics include R-squared, Adjusted R-squared, MSE, RMSE, and MAE, helping assess model accuracy and predictive performance.

1. R-squared (R²)

R² measures the proportion of variance in the dependent variable explained by the independent variables. A value closer to 1 indicates a better fit, while 0 means the model explains no variance. However, adding more variables can artificially increase R².

2. Adjusted R-squared

Unlike R², Adjusted R-squared accounts for the number of predictors in the model. It penalizes unnecessary variables, making it more reliable for multiple regression.

3. MSE (Mean Squared Error)

MSE calculates the average of squared differences between actual and predicted values. A lower MSE means better model performance, but it penalizes large errors more than small ones.

4. RMSE (Root Mean Squared Error)

RMSE is the square root of MSE, making it more interpretable as it is in the same unit as the target variable. Lower RMSE indicates a better model.

5. MAE (Mean Absolute Error)

MAE measures the average absolute difference between actual and predicted values. Unlike MSE, it treats all errors equally and is less sensitive to outliers.

Types of Linear Regression

Linear Regression can be categorized into different types based on the number of independent variables and the nature of the data.

1. Simple Linear Regression

Simple linear regression is useful for predicting and understanding correlations between one independent variable and one dependent variable.

Y = m*x + c

2. Multiple Linear Regression

Multiple regression is similar to linear regression, but it includes more than one independent value, implying that we attempt to predict a value based on two or more variables.

3. Polynomial Regression

Polynomial regression is a type of regression analysis that uses the independent variable’s higher-degree functions, such as squares and cubes, to fit the data. It allows for more intricate interactions between variables than linear regression.

Python Implementation of Linear Regression

Before diving into the linear regression exercise using Python, it’s crucial to familiarize ourselves with the dataset. We’ll be analyzing the Boston Housing Price Dataset, which comprises 506 entries and 13 attributes, along with a target column. Let’s briefly inspect this dataset.

1. Data Description

Crim: Crime rate per capita by town
Zn: Fraction of residential land allocated for large plots (over 25,000 sq. ft.)
Indus: Fraction of non-retail business acres in town
Chas: Indicator for Charles River proximity (1 if close; 0 if not)
Nox: PPM (parts per 10 million) concentration of nitrogen oxides
Rm: Typical number of rooms in a residence
Age: Fraction of homes built before 1940
Dis: Average distance to five major Boston workplaces
Rad: Proximity index to major highways
Tax: Property tax rate (per $10,000)
Ptratio: Student-to-teacher ratio in town
Black: Value calculated as 1000(Bk – 0.63)^2, where Bk represents the fraction of Black residents in town
Lstat: Percentage of the population with lower status
Medv: Median price of homes (in $1000s)

In this linear regression tutorial, our objective is to develop two predictive models for housing prices.

2. Model Development

With a clear understanding of our dataset, let’s proceed to construct our linear regression models in Python.

Univariate Linear Regression in Python

Take ‘lstat’ as independent and ‘medv’ as dependent variables or Using ‘lstat’ as the predictor and ‘medv’ as the response:

Step 1: Load the Boston Dataset

import pandas as pd
data = pd.read_csv("Boston1.csv")

Step 2: Data Inspection Phase

data.head()

Step 3: Preview predictor and response variables

data = data.loc[:, ["lstat", "medv"]

Step 4: Visualize variable trends

import matplotlib.pyplot as plt
data.plot(x = "lstat", y = "medv", style = "o")
plt.xlabel("lstat")
plt.ylabel("medv")
plt.show()

Step 5: Segregate data into predictors and responses

x = pd.DataFrame(data["lstat"])
y = pd.DataFrame(data["medv"])

Step 6: Partition data for training and testing

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1

Step 7: Review dimensions of training and test datasets

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

Step 8: Start the model training

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

Step 9: Extract the y-intercept

print(regressor.intercept_)

Step 10: Extract the regression coefficient

print(regressor.coef_)

Step 11: Generate predictions

print(y_pred)

Step 12: Compare with actual values

y_test

Step 13: Assess model performance

import numpy as np
from sklearn import metrics
print("Mean Absolute Error", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error", np.sqrt(metrics.mean_absolute_error(y_test, y_pred)))

Applications of Linear Regression

Linear Regression is widely used in various domains for predictive analysis and decision-making. Here are some key applications:

1. Evaluating trends and sales estimates

Businesses use Linear Regression to examine past sales data and forecast future trends. Businesses may optimize inventories, marketing efforts, and revenue estimates by understanding how factors such as pricing, promotions, and client demand affect sales.

2. Market analysis

Market analysts use Linear Regression to analyze consumer behavior, competitive pricing, and demand variations. It assists organizations in determining how variables such as advertising expenditure, economic conditions, and customer preferences affect product success in the market.

3. Healthcare

In medicine, Linear Regression is used to forecast disease development, patient recovery rates, and treatment results based on age, lifestyle, and medical history. It is also useful for calculating healthcare expenses and optimizing resource allocation.

4. Forecasting consumer spending

Economists and businesses use Linear Regression to forecast consumer spending based on income, inflation, and interest rates. This aids in budgeting, price tactics, and financial planning to keep up with market demand.

Advantages and Disadvantages of Linear Regression

1. Advantages of Linear Regression

Linear Regression is easy to understand and interpret, making it a great starting point for statistical modeling.
It requires minimal computational power, making it ideal for large datasets.
If there is a linear relationship between the independent and dependent variables, Linear Regression performs well.
Helps identify the impact of each independent variable on the dependent variable using coefficients.
Frequently used in forecasting sales, trends, and other business metrics.

2. Disadvantages of Linear Regression

Linear Regression assumes a straight-line relationship, which may not hold for complex, non-linear data.
Outliers can significantly affect the regression line, leading to inaccurate predictions.
When independent variables are highly correlated, it can distort coefficient estimates and reduce model reliability.
It does not handle categorical data well unless properly encoded.
With too many features, the model may fit the training data too well but perform poorly on new data.

Conclusion

Linear Regression is a powerful yet simple technique for analyzing relationships and making predictions. While it has limitations, its efficiency and interpretability make it a key tool in data science. If you want to master Linear Regression and other ML techniques, then you should join our Data Science course today!