Explore Courses Blog Tutorials Interview Questions
+2 votes
1 view
in Machine Learning by (4.2k points)

I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.. it doesn't consider the frequency of the words as the feature to look at ("bag-of-words").

One of the answers seems to suggest this can't be done with the built in NLTK classifiers. Is that the case? How can I do frequency/bag-of-words NB classification with NLTK?

1 Answer

+2 votes
by (6.8k points)

scikit-learn has an associated implementation of multinomial naive Bayes, which is that the right variant of naive Bayes during this scenario. A support vector machine (SVM) would probably work better, though.

As Ken realized within the comments, NLTK features a nice wrapper for scikit-learn classifiers. Modified from the docs, here's a somewhat sophisticated one that will TF-IDF coefficient, chooses the 1000 best options supported a chi2 data point, so passes that into a multinomial naive Bayes classifier. (I bet this is somewhat clumsy, as I'm not super familiar with either NLTK or scikit-learn.)

import numpy as np

from nltk.probability import FreqDist

from nltk.classify import SklearnClassifier

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfTransformer()), ('chi2', SelectKBest(chi2, k=1000)), ('nb', MultinomialNB())])

classif = SklearnClassifier(pipeline)

from nltk.corpus import movie_reviews

pos = [FreqDist(movie_reviews.words(i))

for i in movie_reviews.fileids('pos')]

neg = [FreqDist(movie_reviews.words(i))

for i in movie_reviews.fileids('neg')]

add_label = lambda lst, lab: [(x, lab) for x in lst] classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))

l_pos = np.array(classif.classify_many(pos[100:]))

l_neg = np.array(classif.classify_many(neg[100:]))

print "Confusion matrix:\n%d\t%d\n%d\t%d" nada ( (l_pos == 'pos').sum(), (l_pos == 'neg').sum(), (l_neg == 'pos').sum(), (l_neg == 'neg').sum())

 Since Bayes' is an important part of Machine Learning, studying Machine Learning Course will be an important aspect as far as the software domain is considered.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers


94k users

Browse Categories