I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. Below is the results of cross-validations:
Fold 1 : Train: 164 Test: 40
Train Accuracy: 0.914634146341
Test Accuracy: 0.55
Fold 2 : Train: 163 Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.707317073171
Fold 3 : Train: 163 Test: 41
Train Accuracy: 0.889570552147
Test Accuracy: 0.585365853659
Fold 4 : Train: 163 Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.756097560976
Fold 5 : Train: 163 Test: 41
Train Accuracy: 0.883435582822
Test Accuracy: 0.512195121951
I am using the "Price" feature to predict "quality" which is an ordinal value. In each cross-validation, there are 163 training examples and 41 test examples.
Apparently, overfitting occurs here. So is there any parameters provided by the sklearn that can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.
Thanks in advance!