2 views

From my research, I found three conflicting results:

Can someone explain when to use LinearSVC vs. SVC(kernel="linear")?

It seems like LinearSVC is marginally better than SVC and is usually more finicky. But if scikitdecided to spend time on implementing a specific case for linear classification, why wouldn't LinearSVC outperform SVC?

by (6.8k points)

Mathematically, optimizing an SVM is a convex optimization problem, usually with a unique minimizer. This means that there is only one solution to this mathematical optimization problem.

The differences in results come from several aspects: SVC and LinearSVC are supposed to optimize the same problem, but in fact, all liblinear estimators penalize the intercept, whereas libsvm ones don't (IIRC). This results in completely different mathematical optimization problem and so different results. There may also be other subtle differences such as scaling and default loss function (edit: make sure you set loss='hinge' in LinearSVC). Next, in multiclass classification, liblinear does one-vs-rest by default whereas libsvm does one-vs-one.

SGDClassifier(loss='hinge') is totally different from the opposite two within the sense that it uses stochastic gradient descent and not actual gradient descent and will not converge to identical answers. However, the obtained solution may generalize better.

Between SVC and LinearSVC, one important decision criterion is that LinearSVC tends to be faster to converge the larger the number of samples is. This is because of the actual fact that the linear kernel could be a special case, that is optimized for in Liblinear, however not in Libsvm.

Truly, LinearSVC and SVC(kernel=' linear ') yield different results, i. e. metrics score and decision boundaries, because they use different approaches. The toy example below proves it:

from sklearn.svm import LinearSVC, SVC

clf_1 = LinearSVC().fit(X, y)  # possible to state loss='hinge'

clf_2 = SVC(kernel='linear').fit(X, y)

score_1 = clf_1.score(X, y)

score_2 = clf_2.score(X, y)

print('LinearSVC score %s' % score_1)

print('SVC score %s' % score_2)

The key principles of that difference are the following:

1. By default scaling, LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss. It is potential to manually outline a 'hinge' string for loss parameter in LinearSVC.
2. LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while SVCuses the One-vs-One multiclass reduction. It is also noted here. Also, for multi-class classification problem SVC fits N * (N - 1) / 2 models where N is the number of classes.
3. LinearSVC, by contrast, simply fits N models. If the classification problem is binary, then just one model is slot in each situations. multi_class and decision_function_shape parameters have nothing in common. The second one is an aggregator that transforms the results of the decision function in a convenient shape of (n_features, n_samples). multi_class is an algorithmic approach to establish a solution.
4. The underlying estimators for LinearSVC are liblinear, which does, in fact, penalize the intercept. SVC uses libsvm estimators that do not. liblinear estimators are optimized for a linear (special) case and so converge quicker on massive amounts of knowledge than libsvm. That is why LinearSVCtakes less time to resolve the problem.

For more details on this, check the Machine Learning Tutorial