How do I solve overfitting in random forest of Python sklearn?

Question

asked Jul 12, 2019 in Machine Learning by ParasSharma1 (19k points)

I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. Below is the results of cross-validations:

Fold 1 : Train: 164 Test: 40
Train Accuracy: 0.914634146341
Test Accuracy: 0.55
Fold 2 : Train: 163 Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.707317073171
Fold 3 : Train: 163 Test: 41
Train Accuracy: 0.889570552147
Test Accuracy: 0.585365853659
Fold 4 : Train: 163 Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.756097560976
Fold 5 : Train: 163 Test: 41
Train Accuracy: 0.883435582822
Test Accuracy: 0.512195121951

I am using the "Price" feature to predict "quality" which is an ordinal value. In each cross-validation, there are 163 training examples and 41 test examples.

Apparently, overfitting occurs here. So is there any parameters provided by the sklearn that can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.

Thanks in advance!

1 Answer

Anurag · Answer 1 · 2019-07-15T07:14:37+0000

Your dataset is quite small to train a machine learning model properly. You should collect more data, then you would have less chance to overfit. The adequate amount of data helps machine learning models to find patterns easily.

There are some parameters of random forest that can be tuned for the model’s better performance.

n_estimators: The more trees, the less likely the algorithm is to overfit. So try increasing this parameter. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
max_features: You should try reducing this number. This defines how many features each tree is randomly assigned.
max_depth: This parameter will reduce the complexity of the learned models, lowering over fitting risk.
min_samples_leaf: Try setting these values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

Hope this answer helps.

If you wish to learn more about Python, visit the Python tutorial and Python Certification course by Intellipaat.

How do I solve overfitting in random forest of Python sklearn?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources