Classification Report - Precision and F-score are ill-defined

While you are working with classification problems in machine learning, evaluation metrics like Precision, Recall, and F-score play an important role in understanding the performance of the model. In some cases, you might also encounter a warning which says:

“Precision and F-score are ill-defined and are set to 0.0 in labels with no predicted samples.”

This message always creates confusion among beginners. In this blog, we will explain why this message appears, what it means, and how to handle this error effectively. So let’s get started!

Table of Contents

Understanding the Classification Report
- Example: Generating a Classification Report
Why do we get this warning?
How to Handle this issue?
Alternative Metrics for Imbalanced Data
Why Precision-Recall AUC is More Useful Than ROC AUC in Some Cases
- Precision-Recall (PR) AUC as an Alternative
- Comparing ROC AUC and PR AUC
Best Practices
Conclusion
FAQs

Understanding the Classification Report

A Classification Report provides a performance summary of a classifier. It provides key evaluation metrics that help to assess how well the model is able to predict different types of classes. It included 3 key metrics which are mentioned below.

Precision: It provides the proportion of correctly predicted positive samples out of all the predicted values. This helps to measure the accuracy of the positive classifications.

Recall: It helps to determine the number of actual positive samples that were successfully identified by the model. This makes it useful for applications where it proves costly for missing positive cases.

F1-score: It serves as a perfect balance between Precision and Recall. This provides a single measure of the effectiveness of the model taking into consideration both the false positives and the false negatives.

By using these 3 metrics together, you can have a better understanding of whether a model is biased towards a particular class or performs consistently across all categories.

Example: Generating a Classification Report

For the understanding of the working process of a Classification Report, we will be building a simple classification model using scikit-learn. We will then train the classifier on a sample dataset, make predictions, and generate the classification report which will help to evaluate the performance of the model.

Example:

Python

Output:

Explanation:

The report we get provides the precision, recall, and F1-score for each class (0 and 1). The support column denotes the number of samples belonging to each class. The accuracy denotes the correctness of the overall predictions. Lastly, the macro avg and the weighted avg denote the overall performance measures of the model.

Why do we get this warning?

During the generation of a classification report, you may encounter a message which says:

“Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.”

This warning message appears when the complexity of the models fails to predict one or more classes in the dataset. This happens in cases where there is a severe class imbalance or when the classifier is biased towards a dominant class.

For better understanding, let us take a scenario where the model predicts all samples as class 0, by completely ignoring class 1. In such cases, precision for class 1 is undefined, because precision is calculated as:

Precision = True Positives (TP) / True Positives (TP) + False Positives (FP)

For class 1, if the precision is not predicted then both the TP (True Positives) and FP (False Positives) for that class are 0. This is because the denominator becomes 0, which is mathematically undefined. Since the F1-score is dependent on precision and recall, it also becomes undefined.

To handle this issue in a better way, Scikit-Learn sets the precision and F1-score to 0.0 for such cases and raises a warning message to inform the user. Although this warning does not cause any error, a critical issue is highlighted in the model which depicts the failure of the model to recognize certain classes. This required minor adjustments like balancing the dataset, tuning the model, or use of different evaluation metrics.

How to Handle this Issue?

Below are given some methods on how you can handle this issue.

1. Use the zero_division Parameter

By using scikit-learn, you can control the way undefined values are handled by using the zero_division. This parameter helps to control what happens when precision or F-score becomes undefined. This happens due to a lack of predictions for a particular class.

Setting zero_division=0 ensures that if there is any undefined precision or F-score they should be replaced by 0, which is a default behavior. This means that if a class is not predicted at all, its precision and F-score will be denoted as 0.0 instead of causing an error.
Setting zero_division=1 helps to replace undefined values with 1 instead of 0. This turns out to be useful in cases where you want to avoid zero value which affects downstream calculations.

Example: Handling undefined scores

Python

Output:

2. Check if the Model is Biased

If the model predicts only one class, it might be biased due to:

Imbalanced dataset: If one class has more samples than the other class, the minority class may be ignored by the model.
Poor Model Training: Due to the poor training process of the model, the model might not have learned enough from the dataset.

The solution for this issue is to balance the dataset. This can be done by using oversampling or undersampling which ensures a balanced dataset. You can also train the model on more data to improve the learning process of the model.

3. Adjust the Decision Threshold

A default probability threshold of 0.5 is used by most classification models. This means that if the predicted probability of a class is greater than or equal to 0.5, then it is classified as positive (1); otherwise it is classified as negative (0). However, if only one class is predicted by the model (e.g., always predicting 0), you can try adjusting the threshold which helps to improve predictions, especially in imbalanced datasets. Lowering or increasing the threshold helps to improve the sensitivity (recall) or specificity, depending on the problem.

Example:

Python

Output:

Explanation:

The above classification report shows that the model has a high accuracy (98%). Class 0 is predicted very well (99% precision and recall), and class 1 is a little less precise (86%) and recall (92%). This happens because of class imbalance.

Alternative Metrics for Imbalanced Data

While you are dealing with imbalanced datasets, using traditional evaluation metrics like accuracy and ROC AUC may not provide a clear idea about the performance of the model. This is because they are biased towards the majority class. You can use alternative metrics in such cases like, Precision-Recall AUC, Matthews Correlation Coefficient (MCC), and Balanced Accuracy.

Why Precision-Recall AUC is More Useful Than ROC AUC in Some Cases

For plotting the True Positive Rate (TPR) against FPR (False Positive Rate) at different threshold values you can use the Receiver Operating Characteristic (ROC) curve. For measuring the overall ability of the model to distinguish between classes you can use the Area Under the Curve (AUC).

However, ROC AUC is not always the ideal metric for imbalanced datasets. This is because:

FPR can be misleading: If you have a dataset that is highly skewed (e.g., 99% class 0 and 1% class 1), even a small number of false positives can lead to a high False Positive Rate (FPR).
Class Imbalance affects AUC: As the model predicts the majority class well, it may appear to have high performance, even though it fails on the minority class.

Precision-Recall (PR) AUC as an Alternative

When the positive class is rare the Precision-Recall (PR) curve is useful. This is because it focuses on Precision (Positive Predictive Value) and Recall (Sensitivity), which are more meaningful in such scenarios.

Precision: It is used to calculate what proportion of the instances that are predicted as positive are actually correct.
Recall: It is used to calculate the proportion of true positive instances that are correctly predicted out of all actual positive samples.
PR AUC provides a better indication of how well the model is able to identify the minority class.

Comparing ROC AUC and PR AUC

Now, let us generate an imbalanced dataset, train a classifier, and compare ROC AUC vs. PR AUC.

Example:

Python

Output:

Explanation:

In the above output, the Precision-Recall (PR) curve denotes the trade-off between precision and recall for different thresholds. Its PR AUC score is 0.66, which indicates the performance of the model. The ROC AUC score is 0.83 which shows that the model has a good classification ability.

Best Practices

Instead of ROC AUC you can use Precision-Recall AUC for imbalanced datasets.
You should adjust probability thresholds which will help you to improve recall or precision based on the use case.
Always handle class imbalance using techniques like SMOTE, undersampling, or class weighing.
To understand model predictions better, always check the confusion matrix.
To avoid undefined metric warnings, always set the zero_division parameter in classification_report.
You can use multiple evaluation metrics to get a holistic view of the performance of the model.
Hyperparameter tuning is important to optimize precision-recall trade-offs effectively.

Conclusion

Evaluation of a classification model goes beyond accuracy. The use of metrics such as precision, recall, and F1-score plays an important role in understanding the performance of the model, especially for imbalanced datasets. The classification report provides important insights, but when precision and F-score get ill-defined due to missing predictions for certain classes, it is important that you interpret results carefully. Adjusting probability methods, using Precision-Recall AUC instead of ROC AUC for imbalanced data, and using metrics like Matthews Correlation Coefficient (MCC), will help provide a more reliable assessment. By following the best practices mentioned in the blog, and considering multiple evaluation techniques, you can build robust and fair classification models that perform well in real-life scenarios.

FAQs

1. Why does the warning “Precision and F-score and ill-defined” appear in the classification report?

This warning appears when no prediction is made by the model for a particular class. This leads to undefined precision and F-score values as the denominator in their formulas becomes zero.

2. How can I handle the issue of ill-defined precision and F-score?

You can handle the issue of ill-defined precision and F-score by using zero_division=0 or zero_division=1 in classification_report() from scikit-learn to explicitly set undefined values to 0 or 1. Additionally, you can also adjust the decision threshold which may help improve the predictions of the classes.

3. Why is Precision-Recall AUC preferred over ROC AUC for imbalanced datasets?

Precision-recall AUC is preferred over ROC AUC for imbalanced datasets because it focuses more on the minority class and then evaluates the identification of positive instances in the model. Whereas ROC AUC can mislead you when you have a highly imbalanced dataset.

4. What alternative metrics can be used for evaluating imbalanced classification models?

You can use alternative metrics like Matthews Correlation Coefficient (MCC) and Balanced Accuracy. This is because they provide a more holistic evaluation of model performance across all the datasets.

5. How can I improve my model when it predicts only one class?

To improve the model which predicts only one class, you can try techniques like adjusting the decision threshold, using different loss functions (e.g., focal loss for imbalanced data), or oversampling/undersampling methods like SMOTE which will help to balance the dataset and improve class predictions.

Classification Report – Precision and F-score are ill-defined

Understanding the Classification Report

Example: Generating a Classification Report

Why do we get this warning?

How to Handle this Issue?

1. Use the zero_division Parameter

2. Check if the Model is Biased

3. Adjust the Decision Threshold

Alternative Metrics for Imbalanced Data

Why Precision-Recall AUC is More Useful Than ROC AUC in Some Cases

Precision-Recall (PR) AUC as an Alternative

Comparing ROC AUC and PR AUC

Best Practices

Conclusion

FAQs

About the Author