Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I am working on a supervised learning task to train a binary classifier.

I have a dataset with a large class imbalance distribution: 8 negative instances every one positive.

I use the f-measure, i.e. the harmonic mean between specificity and sensitivity, to assess the performance of a classifier.

I plot the ROC graphs of several classifiers and all present a great AUC, meaning that the classification is good. However, when I test the classifier and compute the f-measure I get a really low value. I know that this issue is caused by the class skewness of the dataset and, by now, I discover two options to deal with it:

Adopting a cost-sensitive approach by assigning weights to the dataset's instances (see this post)

Thresholding the predicted probabilities returned by the classifiers, to reduce the number of false positives and false negatives.

I went for the first option and that solved my issue (f-measure is satisfactory). BUT, now, my question is: which of these methods is preferable? And what are the differences?

1 Answer

0 votes
by (33.1k points)

In terms of machine learning, both weighting (cost-sensitive) and thresholding are valid forms of cost-sensitive learning. 

Weighting

The ‘cost’ or loss of misclassifying the rare class is worse than misclassifying the common class. This is applied at the algorithmic level in such algorithms as SVM, ANN, and Random Forest. 

Thresholding

If the model returns probabilities, thresholding can be applied after a model has been built. Generally, you change the classification threshold from 50-50 to an appropriate trade-off level.

Model Building

Building models with imbalanced data is that you should keep in mind your model metric. For example, metrics such as F-measures don’t take into account the true negative rate. 

Study Datasets For Machine Learning for more details on the above topics. Also, for better knowledge of the aforementioned domains, study Machine Learning Tutorial.

Hope this answer helps you!

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...