0 votes
1 view
in Machine Learning by (13.5k points)

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought a random forest regressor handles this but I got an error when I call to predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])

y_train = np.array([1, 2])

clf = RandomForestRegressor(X_train, y_train)

X_test = np.array([7, 8, np.nan])

y_pred = clf.predict(X_test) # Fails!

Can I not call predict with any scikit-learn algorithm with missing values?

Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

1 Answer

0 votes
by (33.1k points)

You can simply use the Imputer class in Scikit learn library. It inputs the missing value places with the mean, median or mode of the column/dataset. 

For example:

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')

imp = imp.fit(X_train)

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...