Feature selection using scikit-learn

Question

asked Jul 19, 2019 in Machine Learning by ParasSharma1 (19k points)

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Since my dataset consists of negative values, I get the following error:

ValueError Traceback (most recent call last)
/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1
      2
      3
      4
      5
/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
    427 else:
    428 # fit method of arity 2 (supervised transformation)
--> 429 return self.fit(X, y, **fit_params).transform(X)
    430
    431
/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300 self._check_params(X, y)
    301
--> 302 self.scores_, self.pvalues_ = self.score_func(X, y)
    303 self.scores_ = np.asarray(self.scores_)
    304 self.pvalues_ = np.asarray(self.pvalues_)
/usr/local/lib/python2.6/dist- packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190 X = atleast2d_or_csr(X)
    191 if np.any((X.data if issparse(X) else X) < 0):
--> 192 raise ValueError("Input X must be non-negative.")
    193
    194 Y = LabelBinarizer().fit_transform(y)
ValueError: Input X must be non-negative.

Can someone tell me how can I transform my data?

1 Answer

Anurag · Answer 1 · 2019-07-20T04:24:23+0000

The error message you got that:

Input X must be non-negative

Tells that: Pearson's chi-square test (goodness of fit) does not apply to negative values. It occurred because the chi-square test assumes frequencies distribution and a frequency can't be a negative number. But, sklearn.feature_selection.chi2 asserts the input as non-negative.

The features are "min, max, mean, median and FFT of accelerometer signal" in your data. In many cases, it may be quite safe to simply shift each feature to make them all positive, or even normalize to [0, 1] interval.

If data transformation is for some reason not possible (e.g. a negative value is an important factor), then you should pick another statistic to score your features:

sklearn.feature_selection.f_classif computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif

computes the mutual information

The procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end results usually the same or very close.

Thus, more details study Scikit Learn Cheat Sheet and Datasets For Machine Learning.

Hope this answer helps.

Feature selection using scikit-learn

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources