Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,

                                        criterion='gini', 

                                        max_depth=None,

                                        min_samples_split=2,

                                        min_samples_leaf=1, 

                           min_weight_fraction_leaf=0.0, 

                                    max_features='auto', 

                                   max_leaf_nodes=None, 

                                         bootstrap=True, 

                                      oob_score=False,

                                              n_jobs=1, 

                                     random_state=None,

                                              verbose=0, 

                                     warm_start=False, 

                                    class_weight=None)

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

I tried different n_estimators and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_ array) increased dramatically.

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

1 Answer

0 votes
by (33.1k points)

Random Forest Classifier has three important parameters in Scikit implementation:

  • n_estimators

  • max_features

  • criterion

In n_estimators, the more estimators you give, the better the model will do. 

Max_features can be tried at different parameters to get better accuracy.

criterion makes a small impact, but usually, the default is fine.

You can use sklearn's GridSearchCV, it automatically iterates over different parameters to give you the best estimators.

Hope this answer helps.

Browse Categories

...