0 votes
1 view
in Machine Learning by (13.5k points)

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:

viagra = None          ok : spam = 4.5 : 1.0

hello = True           ok : spam = 4.5 : 1.0

hello = None           spam : ok = 3.3 : 1.0

viagra = True          spam : ok = 3.3 : 1.0

casino = True          spam : ok = 2.0 : 1.0

casino = None          ok : spam = 1.5 : 1.0

My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.

If there is no such function yet, does somebody know a workaround how to get to those values?

Thanks a lot!

1 Answer

0 votes
by (33.1k points)

You can extract your features using a Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer, and you are using a linear model. You can use this code for binary classification:

def show_most_informative_features(vectorizer, clf, n=20):

    feature_names = vectorizer.get_feature_names()

    coefs_with_fns = sorted(zip(clf.coef_[0],              feature_names))

    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n                +1):-1])

    for (coef_1, fn_1), (coef_2, fn_2) in top:

        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" %(coef_1,                fn_1, coef_2, fn_2))

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...