Using the predict_proba() function of RandomForestClassifier in the safe and right way

Question

asked Jul 18, 2019 in Machine Learning by ParasSharma1 (19k points)

I'm using Scikit-learn to apply a machine learning algorithm on my datasets. Sometimes I need to have the probabilities of labels/classes instated of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam.

For such purpose, I'm using predict_proba() with RandomForestClassifier as following:

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
min_samples_split=1, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())
classifier = clf.fit(X,y)
predictions = classifier.predict_proba(Xtest)
print(predictions)

And I got those results:

[ 0.4 0.6]
[ 0.1 0.9]
[ 0.2 0.8]
[ 0.7 0.3]
[ 0.3 0.7]
[ 0.3 0.7]
[ 0.7 0.3]
[ 0.4 0.6]

Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results only show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?

Many thanks in advance for your time in reading these two issues and their questions.

1 Answer

Anurag · Answer 1 · 2019-07-19T07:40:03+0000

A Random Forest Classifier is a group of Decision Trees used. One class has probability 1, the other classes have probability 0.

The Random Forest simply votes among the results. The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest. Your precision is exactly 1/n_estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive. You normally don't want more than 100 estimators.

Hope this answer helps. For more details, study Decision Trees In Machine Learning. Also, study Machine Learning Online Course for more details on the topic.

Using the predict_proba() function of RandomForestClassifier in the safe and right way

1 Answer

Related questions

Browse Categories