Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows:

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):

    labelid = list(classifier.classes_).index(classlabel)

    feature_names = vectorizer.get_feature_names()

    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:

        print classlabel, feat, coef

Then:

most_informative_feature_for_class(tfidf_vect, clf, 5)

For this classfier:

X = tfidf_vect.fit_transform(df['content'].values)

y = df['label'].values


 

from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,

                                                    y, test_size=0.33)

clf = SVC(kernel='linear', C=1)

clf.fit(X, y)

prediction = clf.predict(X_test)

The problem is the output of most_informative_feature_for_class:

5 a_base_de_bien bastante   (0, 2451) -0.210683496368

  (0, 2074) 0.310556919237

  (0, 5262) 0.176400451433

  (0, 6373) 0.290124806858

  (0, 8593) 0.290124806858

  (0, 12002)    0.282832270298

  (0, 15008)    0.290124806858

  (0, 19207)    0.326774799211

It is not returning the label nor the words. Why this is happening and how can I print the words and the labels?. Do you guys this is happening since I am using pandas to read the data?. Another thing I tried is the following, form this question:

def print_top10(vectorizer, clf, class_labels):

    """Prints features with the highest coefficient values, per class"""

    feature_names = vectorizer.get_feature_names()

    for i, class_label in enumerate(class_labels):

        top10 = np.argsort(clf.coef_[i])[-10:]

        print("%s: %s" % (class_label,

              " ".join(feature_names[j] for j in top10)))


 

print_top10(tfidf_vect,clf,y)

But I get this traceback:

Traceback (most recent call last):

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>

    print_top10(tfidf_vect,clf,5)

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10

    for i, class_label in enumerate(class_labels):

TypeError: 'int' object is not iterable

Any idea of how to solve this, in order to get the features with the highest coefficient values?.

1 Answer

0 votes
by (33.1k points)

To solve this specifically for linear SVM, simply understand the formulation of the SVM in sklearn and the differences that it has to MultinomialNB.

The most_informative_feature_for_class works for MultinomialNB are because the output of the coef_ is basically the log probability of features given a class and size [nclass, n_features], due to the formulation of the Naive Bayes problem. 

In SVM, the coef_ is not a simple concept. Instead coef_ for (linear) SVM is [n_classes * (n_classes -1)/2, n_features] because each of the binary models are fitted to every possible class.

For example:

def most_informative_feature_for_class_svm(vectorizer, classifier,  classlabel, n=10):

    labelid = ?? # this is the coef we're interested in. 

    feature_names = vectorizer.get_feature_names()

    svm_coef = classifier.coef_.toarray() 

    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:

        print feat, coef

You can simply print out the labels and the top n features according to the coefficient vector:

For example:

import codecs, re, time

from itertools import chain

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'

# Vectorizing data.

train = []

word_vectorizer = CountVectorizer(analyzer='word')

trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))

tags = ['bs','pt','es','sr']

# Training NB

mnb = MultinomialNB()

mnb.fit(trainset, tags)

from sklearn.svm import SVC

svcc = SVC(kernel='linear', C=1)

svcc.fit(trainset, tags)

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):

    labelid = list(classifier.classes_).index(classlabel)

    feature_names = vectorizer.get_feature_names()

    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:

        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):

    labelid = 3 # this is the coef we're interested in. 

    feature_names = vectorizer.get_feature_names()

    svm_coef = classifier.coef_.toarray() 

    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:

        print feat, coef

most_informative_feature_for_class(word_vectorizer, mnb, 'pt')

print 

most_informative_feature_for_class_svm(word_vectorizer, svcc)

Output:

pt teve -4.63472898823

pt tive -4.63472898823

pt todas -4.63472898823

pt vida -4.63472898823

pt de -4.22926388012

pt foi -4.22926388012

pt mais -4.22926388012

pt me -4.22926388012

pt as -3.94158180767

pt que -3.94158180767

no 0.0204081632653

parecer 0.0204081632653

pone 0.0204081632653

por 0.0204081632653

relación 0.0204081632653

una 0.0204081632653

visto 0.0204081632653

ya 0.0204081632653

es 0.0408163265306

lo 0.0408163265306

Hope this answer helps.

Browse Categories

...