Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)

I am working on a prediction problem using a large textual dataset. I am implementing the Bag of Words Model.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting a bag of words? Or should I use some other algorithms? I am using python.

1 Answer

0 votes
by (33.1k points)

You should simply use the collections.Counter class in python.

>>> import collections, re

>>> texts = ['John likes to watch movies. Mary likes too.',

   'John also likes to watch football games.']

>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))

            for txt in texts]

>>> bagsofwords[0]

Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})

>>> bagsofwords[1]

Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})

>>> sumbags = sum(bagsofwords, collections.Counter())

>>> sumbags

Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})

Hope this answer helps.

To know more about Textual Data, study Machine Learning Online Course. Also, Datasets For Machine Learning will also be beneficial.

Browse Categories