0 votes
1 view
in Machine Learning by (4.8k points)

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy.

Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn.

However, it has been suggested that the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia.

Is it possible to use a specific setting for mutual information in scikit-learn to accomplish this task?

1 Answer

+1 vote
by (7.9k points)
edited by

You can use scikit-learn's mutual_info_classif here is an example

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_selection import mutual_info_classif

from sklearn.feature_extraction.text import CountVectorizer

categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)

X, Y = newsgroups_train.data, newsgroups_train.target

cv = CountVectorizer(max_df=0.95, min_df=2, max_features=10000, stop_words='english')

X_vec = cv.fit_transform(X)

res = dict(zip(cv.get_feature_names(), mutual_info_classif(X_vec, Y, discrete_features=True)))

print(res)

image

this will output a dictionary of each attribute, i.e. item in the vocabulary as keys and their information gain as values

result can be downloaded from the given link Downloading 20news dataset. This may take a few minutes.

Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
...