classifiers in scikit-learn that handle nan/null

Question

asked Jul 11, 2019 in Machine Learning by ParasSharma1 (19k points)

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought a random forest regressor handles this but I got an error when I call to predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!

Can I not call predict with any scikit-learn algorithm with missing values?

Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

1 Answer

Anurag · Answer 1 · 2019-07-12T06:14:40+0000

You can simply use the Imputer class in Scikit learn library. It inputs the missing value places with the mean, median or mode of the column/dataset.

For example:

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

Hope this answer helps.

classifiers in scikit-learn that handle nan/null

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources