Confusion Matrix in Machine Learning Using Python with Example

Confusion Matrix in Machine Learning Using Python with Example

What Is a Confusion Matrix?

The Confusion matrix is one of the easiest and most intuitive metrics used to find the accuracy of a classification model, where the output can be of two or more categories. This is the most popular method used to evaluate logistic regression.

Without much delay, let’s get started.

The confusion matrix in Python helps us describe the performance of a classification model. In order to build a confusion matrix, all we need to do is create a table of actual values and predicted values.

Confusion1Confusion matrix is quite simple, but the related terminologies can be a bit confusing. Alright, let us understand the terminologies related to confusion matrix with the help of an example.

Let us say, we have a data set with the data of all patients in a hospital. We built a logistic regression model to predict if a patient has cancer or not. There could be four possible outcomes. Let us look at all four.

True Positive

True positive is nothing but the case where the actual value, as well as the predicted value, are true. The patient has been diagnosed with cancer, and the model also predicted that the patient had cancer.

Confusion2

False Negative

In false negative, the actual value is true, but the predicted value is false, which means that the patient has cancer, but the model predicted that the patient did not have cancer.

Confusion3

Get 100% Hike!

Master Most in Demand Skills Now!

False Positive

This is the case where the predicted value is true, but the actual value is false. Here, the model predicted that the patient had cancer, but in reality, the patient didn’t have cancer. This is also known as Type 1 Error.
Confusion4

True Negative

This is the case where the actual value is false and the predicted value is also false. In other words, the patient is not diagnosed with cancer and our model predicted that the patient did not have cancer.

Confusion5

If you are looking for Confusion Matrix in R, here’s a video from Intellipaat.

Video Thumbnail

Understanding Various Performance Metrics

We will be taking the help of a confusion matrix given below in order to find various performance metrics.

Confusion6

Alright, let us start with accuracy:

Accuracy or Classification Accuracy:

  • What: In classification problems, ‘accuracy’ refers to the number of correct predictions made by the predictive model over the rest of the predictions.
  • How:

Confusion8 1

  • When to use: When the target variable classes in the data are nearly balanced
  • When not to use: When the target variables in the data are majority of one class
 

Precision

  • What: Here, ‘precision’ means what proportion of all predictions that we made with our predictive model is actually true.
  • How:

Confusion9

  • It means, when our model predicts that a patient does not have cancer, it is correct 76 percent of the time.

Recall or Sensitivity:

  • What: ‘Recall’ is nothing but the measure that tells what proportion of patients that actually had cancer were also predicted of having cancer. It answers the question, “How sensitive the classifier is in detecting positive instances?”
  • How:

Confusion32

  • It means that 80 percent of all cancer patients are correctly predicted by the model to have cancer.

Specificity:

  • What: It answers question, “How specific or selective is the classifier in predicting positive instances?”
  • How:

Confusion31

  • A specificity of 0.61 means 61 percent of all patients that didn’t have cancer are predicted correctly.
  • What: This is nothing but the harmonic mean of precision and recall.
  • How:

Confusion30

  • F1 score is high, i.e., both precision and recall of the classifier indicate good results.

Implementing Confusion Matrix in Python Sklearn – Breast Cancer

Dataset: In this Confusion Matrix in Python example, the Python data set that we will be using is a subset of the famous Breast Cancer Wisconsin (Diagnostic) data set. Some of the key points about this data set are mentioned below:

  • Four real-valued measures of each cancer cell nucleus are taken into consideration here.
    • Radius_mean represents the mean radius of the cell nucleus
    • Texture_mean represents the mean texture of the cell nucleus
    • Perimeter_mean represents the mean perimeter of the cell nucleus
    • Area_mean represents the mean area of the cell nucleus
  • Based on these measures the diagnosed result is divided into two categories, malignant and benign.
    • Diagnosis column consists of two categories, malignant (M) and benign (B)

Take a look at the dataset:

Confusion10

Step 1: Load the data set

Confusion29

Step 2: Take a glance at the data set

Confusion26 Confusion27

Step 3: Take a look at the shape of the data set

Confusion24 Confusion25

Step 4: Split the data into features (X) and target (y) label sets

Confusion19 Take a look at the feature set:

Confusion20

Confusion21

Take a look at the target set:

Confusion22

Confusion23

Step 5: Split the data into training and test sets importing scikit learn

Confusion18

Step 6: Create and train the model

Confusion16 Confusion17

Redefine Yourself as a Data Analyst Expert
Upgrade Your Data Analysis Knowledge
quiz-icon

 

Step 7: Predict the test set results

Confusion15

Step 8: Evaluate the model using a confusion matrix using sklearn

Confusion13 Confusion14

Note: Here,

  • True positive is 10.
  • True negative is 7.
  • False positive is 1.
  • False negative is 2.

Step 9: Evaluate the model using other performance metrics

Confusion11 Confusion12

Note: A confusion matrix python gives you complete picture of how the classification is working. It also allows you to compute various classification metrics and these metrics can guide your model selection.

What Did We Learn So Far?

In this tutorial, we have discussed the use of the confusion matrix in Machine Learning and its different terminologies. We talked about different performance metrics such as accuracy, precision, recall, and f1 score. In the end, we have implemented one confusion matrix example using sklearn. In the next module, we will increase the precision rate and accuracy with the help of the ROC curve and threshold adjustment. If you want to deep down in the world of automation using machine learning check out Machine Learning Course.

Our Python Courses Duration and Fees

Program Name
Start Date
Fees
Cohort Starts on: 3rd May 2025
₹20,007
Cohort Starts on: 10th May 2025
₹20,007

About the Author

Senior Consultant Analytics & Data Science, Eli Lilly and Company

Sahil Mattoo, a Senior Software Engineer at Eli Lilly and Company, is an accomplished professional with 14 years of experience in languages such as Java, Python, and JavaScript. Sahil has a strong foundation in system architecture, database management, and API integration. 

Full Stack Developer Course Banner