Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (17.6k points)

I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

X = df[[my_features]] #all my features

y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")

rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc'),y)

print("Optimal number of features : %d" % rfecv.n_features_)


I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

X = df[[my_features]] #all my features

y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')

param_grid = { 

    'n_estimators': [200, 500],

    'max_features': ['auto', 'sqrt', 'log2'],

    'max_depth' : [4,5,6,7,8],

    'criterion' :['gini', 'entropy']


k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc'), y_train)




pred = CV_rfc.predict_proba(x_test)[:,1]

print(roc_auc_score(y_test, pred))

However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.


When I run the answer suggested by @Gambit I got the following error:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),

   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',

            criterion='gini', max_depth=None, max_features='auto',

            max_leaf_nodes=None, min_impurity_decrease=0.0,

            min_impurity_split=None, min_samples_leaf=1,

            min_samples_split=2, min_weight_fraction_leaf=0.0,

            n_estimators='warn', n_jobs=None, oob_score=False,

            random_state=42, verbose=0, warm_start=False),

   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,

   verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

I could resolve the above issue by using estimator__ in the param_grid parameter list.

My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

I am happy to provide more details if needed.

1 Answer

0 votes
by (41.4k points)
edited by

So, for fine tuning the hyper parameter of the classifier with Cross validation after feature selection using recursive feature elimination with Cross validation, you should pipeline object because it helps in assembling the data transformation and applying estimator.

This is the code looks:

from sklearn.datasets import load_breast_cancer

from sklearn.feature_selection import RFECV

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.pipeline import Pipeline

#this is the classifier used for feature selection

clf_featr_sele = RandomForestClassifier(n_estimators=30, random_state = 42, class_weight="balanced") 

rfecv = RFECV(estimator=clf_featr_sele, step=1, cv=5, scoring = 'roc_auc')

#you can have different classifier for your final classifier

clf = RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced") 

CV_rfc = GridSearchCV(clf, param_grid={'max_depth':[2,3]}, cv= 5, scoring = 'roc_auc')

pipeline  = Pipeline([('feature_sele',rfecv),('clf_cv',CV_rfc)]), y_train)


If you want to learn Python for Data Science then you can watch this Python tutorial:

Browse Categories