2 views

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability.

In a binary classification problem, it is a scikit's classifier.predict() using 0.5 by default? If it doesn't, what's the default method? If it does, how do I change it?

In scikit, some classifiers have the class_weight='auto' option, but not all do. With class_weight='auto', would .predict() use the actual population proportion as a threshold?

What would be the way to do this in a classifier like MultinomialNB that doesn't support class_weight? Other than using predict_proba() and then calculation the classes myself.

by (33.1k points)

The classifier.predict() function of scikit learn has a threshold of 0.5 by default for classification. If you adjust the threshold manually, then the accuracy can be much better. This should be done with care because it can affect decision boundaries.

You can also include the class_prior in your code, which is the prior probability P(y) per class y. It also shifts the decision boundary.

For example:

>>> X = [[1, 0], [1, 0], [0, 1]]

>>> y = [0, 0, 1]

>>> MultinomialNB().fit(X,y).predict([1,1])

array()

>>> MultinomialNB(class_prior=[.1, .9]).fit(X,y).predict([1,1])

array()