2 views

I have a logistic regression and a random forest and I'd like to combine them (ensemble) for the final classification probability calculation by taking an average.

Is there a built-in way to do this in sci-kit learn? Some way where I can use the ensemble of the two as a classifier itself? Or would I need to roll my own classifier?

by (6.8k points)

Combing probabilities/scores arbitrarily is very problematic, in that the performance of your different classifiers can be different, (For example, an SVM with 2 different kernels , + a Random forest + another classifier trained on a different training set). Thus, Scikit Learn Cheat Sheet is one of the most important aspect as far as

One potential methodology to "weigh" the various classifiers, might be to use their Jaccard score as a "weight". (But be warned, as I know it, the various scores don't seem to be "all created equal", i do know that a Gradient Boosting classifier I even have in my ensemble offers all its scores as 0.97, 0.98, 1.00 or 0.41/0 . I.E. it's very overconfident..). Thus, to know more study Gradient Boosting, but for now, see the following example.

class EnsembleClassifier(BaseEstimator, ClassifierMixin):

def __init__(self, classifiers=None):

self.classifiers = classifiers

def fit(self, X, y):

for classifier in self.classifiers:

classifier.fit(X, y)

def predict_proba(self, X):

self.predictions_ = list()

for classifier in self.classifiers:    self.predictions_.append(classifier.predict_proba(X))

return np.mean(self.predictions_, axis=0)

And also have a look on the link which provides a detail view on the sklearn.ensemble.VotingClassifier?

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier