Introduction to Logistic Regression using Scikit learn
Logistic regression is a widely used model in statistics to estimate the probability of a certain event’s occurring based on some previous data. It works with binary data. Now, what is binary data? Binary data is where we have two outcomes, either the event happens or it does not.
Here’s the table of contents for this module:
Before we move forward, here’s a video from Intellipaat on Logistic Regression.
Watch this logistic regression Machine Learning Video by Intellipaat:
Without much delay, let’s get started.
Before we dive into understanding what logistic regression is and how we can build a model of Logistic Regression in Python, let us see two scenarios and try and understand where to apply linear regression and where to apply logistic regression.
Let us see the first example.
Say, Sam is 20 years old and earns $50,000; Bob is 35 years old and earns $75,000, and the third employee, Matt, who is 50 years old and earns $100,000.
Now if I introduce a new employee, named Tom, aged 28, can we predict his salary?
What we can do is, we can establish a relationship between age and earnings and come up with a conclusion of whether earnings increase or decrease with age, based on the data given. Here, salary would be the dependent variable and age would be the independent variable. And, this would be a case of linear regression.
Now, let us look at another example:
Here, we have two students, Rachel and Ross. Rachel manages to pass the exam, but Ross fails in the exam.
Now, what if another student, Monica, is taking the same test, would she be able to clear the exam? Let us look at the data provided to us. Rachel, being a girl, cleared the exam. But, Ross, being a boy couldn’t clear the exam. All we can say is that, there is a good probability that Monica can clear the exam as well. Again, this is also an example of regression. How? Because, here we are trying to predict if a student is going to clear the exam or not depending upon the gender. Here, result is the dependent variable and gender is the independent variable. Since the result is of binary type—pass or fail—this is an example of logistic regression.
Interested to learn Data Science? Check out the Data Science course offered by Intellipaat in Chennai, Bangalore, Hyderabad, Mumbai and other part of the world.
Now that we have understood when to apply logistic regression, let us try and understand what logistic regression exactly is.
What Is Logistic Regression?
Logistic regression is a regression technique where the dependent variable is categorical. Let us look at an example, where we are trying to predict whether it is going to rain or not, based on the independent variables: temperature and humidity.\
Here, the question is how we find out whether it is going to rain or not. Let us take a step back and try to remember what used to happen in linear regression. We fitted a straight line based on the relationship between the dependent and independent variables. But in logistic regression, the dependent variable is categorical, and hence it can have only two values, either 0 or 1. In the logistic regression model, depending upon the attributes, we get a probability of ‘yes’ or ‘no’. So, we get an S-shaped curve out of this model.
Now, the question is how to find out the accuracy of such a model? This is where the confusion matrix comes into the picture.
Watch This Video On Machine Learning with Scikit-learn
Evaluate the Logistic Regression Model with Scikit learn Confusion Matrix
One very common way of assessing the model is the confusion matrix. What does this confusion matrix do? Well, the confusion matrix would show the number of correct and incorrect predictions made by a classification model compared to the actual outcomes from the data.
Confusion matrix gives a matrix output as shown above. Now, let’s see what TP, FP, FN, and TN are.
- TP or True Positive value defines the number of positive classes predicted correctly as a positive class.
- FP or False Positive value defines the number of negative classes predicted incorrectly as a positive class.
- FN or False Negative value defines the number of positive classes predicted incorrectly as a negative class.
- TN or True Negative value defines the number of negative classes predicted correctly as a negative class.
Now, we are all set to get started with the hands-on in logistic regression. The below given example of Logistic Regression is in Python programming language. We will be using Scikit learn to build the Logistic Regression model.
Have a look at our blog on Scikit-learn Cheat Sheet for the last-minute revision!
Hands-on: Logistic Regression Using Scikit learn in Python- Heart Disease Dataset
- Environment: Python 3 and Jupyter Notebook
- Library: Pandas
- Module: Scikit-learn
Understanding the Dataset
Before we get started with the hands-on, let us explore the dataset. We will be using the Heart Disease Dataset, with 303 rows and 13 attributes with a target column.
In this example, we will build a classifier to predict if a patient has heart disease or not. Let us take a quick look at the dataset.
This data frame contains following columns:
- Age: Age in years
- Sex: Sex (1 = male; 0 = female)
- Cp: Chest pain type (value 1: typical angina; value 2: atypical angina; value 3: non-anginal pain; value 4: asymptomatic pain)
- Trestbps: Resting blood pressure (in mm Hg on admission to the hospital)
- Chol: Serum cholesterol in mg/dl
- Fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- Restecg: Resting electrocardiographic results (value 0: normal; value 1: having ST-T wave abnormality [T wave inversions and/or ST elevation or depression of > 0.05 mV]; value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria)
- Halach: Maximum heart rate achieved
- Exang: Exercise-induced angina (1 = yes; 0 = no)
- Old peak: ST depression induced by exercise relative to rest
- Slope: The slope of the peak exercise ST segment (value 1: upsloping; value 2: flat; value 3: down sloping)
- ca: Number of major vessels (0–3) coloured by fluoroscopy
- target: Value 1: Heart disease; and value 0: No heart disease
Watch this Linear vs Logistic Regression tutorial
Now that we are familiar with the dataset, let us build the logistic regression model, step by step using scikit learn library in Python.
Step 1: Load the Heart disease dataset using Pandas library
Step 2: Have a glance at the shape
Step 3: Have a look at the shape of the dataset
Step 4: Visualize the change in the variables
Step 5: Divide the data into independent and dependent variables
Step 6: Split the data into train and test sets using scikit learn train_test_split module
Step 7: Train the algorithm using scikit learn linear model
Step 8: Predict the test set results
Step 9: Calculate accuracy
Step 10: Evaluate the model using confusion matrix from scikit learn confusion matrix module
Here, the observed values are:
True Positive = 20
False Positive = 6
False Negative = 10
True Negative = 25
In other words,
- Number of positive classes predicted correctly as a positive class are 20.
- Number of negative classes predicted incorrectly as a positive class are 6.
- Number of positive classes predicted incorrectly as negative class are 10.
- Number of negative classes predicted correctly as the negative class are 25.
What Did We Learn?
In this module, we have discussed the use of logistic regression, what logistic regression is, the confusion matrix, and the ROC curve. Toward the end, we have built one logistic regression model using Sklearn in Python. In the next module, we will talk about other algorithms. Let’s meet there!