Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
labelid = list(classifier.classes_).index(classlabel)
feature_names = vectorizer.get_feature_names()
topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
for coef, feat in topn:
print classlabel, feat, coef
Then:
most_informative_feature_for_class(tfidf_vect, clf, 5)
For this classfier:
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)
The problem is the output of most_informative_feature_for_class:
5 a_base_de_bien bastante (0, 2451) -0.210683496368
(0, 2074) 0.310556919237
(0, 5262) 0.176400451433
(0, 6373) 0.290124806858
(0, 8593) 0.290124806858
(0, 12002) 0.282832270298
(0, 15008) 0.290124806858
(0, 19207) 0.326774799211
It is not returning the label nor the words. Why this is happening and how can I print the words and the labels?. Do you guys this is happening since I am using pandas to read the data?. Another thing I tried is the following, form this question:
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
print_top10(tfidf_vect,clf,y)
But I get this traceback:
Traceback (most recent call last):
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
print_top10(tfidf_vect,clf,5)
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable
Any idea of how to solve this, in order to get the features with the highest coefficient values?.