Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)

I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. Below is the results of cross-validations:

Fold 1 : Train: 164  Test: 40

Train Accuracy: 0.914634146341

Test Accuracy: 0.55

Fold 2 : Train: 163  Test: 41

Train Accuracy: 0.871165644172

Test Accuracy: 0.707317073171

Fold 3 : Train: 163  Test: 41

Train Accuracy: 0.889570552147

Test Accuracy: 0.585365853659

Fold 4 : Train: 163  Test: 41

Train Accuracy: 0.871165644172

Test Accuracy: 0.756097560976

Fold 5 : Train: 163  Test: 41

Train Accuracy: 0.883435582822

Test Accuracy: 0.512195121951

I am using the "Price" feature to predict "quality" which is an ordinal value. In each cross-validation, there are 163 training examples and 41 test examples.

Apparently, overfitting occurs here. So is there any parameters provided by the sklearn that can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.

Thanks in advance!

1 Answer

0 votes
by (33.1k points)

Your dataset is quite small to train a machine learning model properly. You should collect more data, then you would have less chance to overfit. The adequate amount of data helps machine learning models to find patterns easily.

There are some parameters of random forest that can be tuned for the model’s better performance.

  • n_estimators: The more trees, the less likely the algorithm is to overfit. So try increasing this parameter. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
  • max_features: You should try reducing this number. This defines how many features each tree is randomly assigned. 
  • max_depth: This parameter will reduce the complexity of the learned models, lowering over fitting risk. 
  • min_samples_leaf: Try setting these values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

Hope this answer helps.

If you wish to learn more about Python, visit the Python tutorial and Python Certification course by Intellipaat.

Browse Categories