Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
4 views
in AI and Deep Learning by (50.2k points)

Use case:

I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with RBF kernel. I need the confidence of the prediction along with the predicted class. I used the predict_proba method of SVC. I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.

Found this question on stack Scikit-learn predict_proba gives wrong answers.

The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.

My questions:

1) If I multiply my dataset by let's say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What side effects will it have? Overfitting?

2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?

3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

1 Answer

0 votes
by (107k points)

Oversampling your data will do little for access using SVM. SVM is based on the concept of support vectors, which are the outliers of a class that define what is in the class and what is not. Oversampling will not construct a new support vector (I am assuming you are already using the train set as the test set).

In this situation, plain oversampling will not provide you any new information regarding the confidence, other than artifacts constructed by unbalanced oversampling since the instances will be exact copies and no distribution changes will occur. You may find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will generate synthetic instances based on the ones you have. In theory, this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Point to be Noted: By definition, all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't true.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...