Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Since my dataset consists of negative values, I get the following error:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()

----> 1 

      2 

      3 

      4 

      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)

    427         else:

    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)

    430 

    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)

    300         self._check_params(X, y)

    301 

--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)

    303         self.scores_ = np.asarray(self.scores_)

    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)

    190     X = atleast2d_or_csr(X)

    191     if np.any((X.data if issparse(X) else X) < 0):

--> 192         raise ValueError("Input X must be non-negative.")

    193 

    194     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Can someone tell me how can I transform my data?

1 Answer

0 votes
by (33.1k points)

The error message you got that:

Input X must be non-negative

Tells that: Pearson's chi-square test (goodness of fit) does not apply to negative values. It occurred because the chi-square test assumes frequencies distribution and a frequency can't be a negative number. But, sklearn.feature_selection.chi2 asserts the input as non-negative.

The features are "min, max, mean, median and FFT of accelerometer signal" in your data. In many cases, it may be quite safe to simply shift each feature to make them all positive, or even normalize to [0, 1] interval.

If data transformation is for some reason not possible (e.g. a negative value is an important factor), then you should pick another statistic to score your features:

sklearn.feature_selection.f_classif computes ANOVA f-value

sklearn.feature_selection.mutual_info_classif

computes the mutual information

The procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end results usually the same or very close.

Thus, more details study Scikit Learn Cheat Sheet and Datasets For Machine Learning

Hope this answer helps.

31k questions

32.9k answers

507 comments

693 users

...