0 votes
1 view
in Machine Learning by (15.7k points)

I am working on a prediction problem using a large textual dataset. I am implementing the Bag of Words Model.

What should be the best way to get the bag of words? Right now, I have tf-idf of the various words and the number of words is too large to use it for further assignments. If I use tf-idf criteria, what should be the tf-idf threshold for getting a bag of words? Or should I use some other algorithms? I am using python.

1 Answer

0 votes
by (33.2k points)

You should simply use the collections.Counter class in python.

>>> import collections, re

>>> texts = ['John likes to watch movies. Mary likes too.',

   'John also likes to watch football games.']

>>> bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))

            for txt in texts]

>>> bagsofwords[0]

Counter({'likes': 2, 'watch': 1, 'Mary': 1, 'movies': 1, 'John': 1, 'to': 1, 'too': 1})

>>> bagsofwords[1]

Counter({'watch': 1, 'games': 1, 'to': 1, 'likes': 1, 'also': 1, 'John': 1, 'football': 1})

>>> sumbags = sum(bagsofwords, collections.Counter())

>>> sumbags

Counter({'likes': 3, 'watch': 2, 'John': 2, 'to': 2, 'games': 1, 'football': 1, 'Mary': 1, 'movies': 1, 'also': 1, 'too': 1})

Hope this answer helps.

To know more about Textual Data, study Machine Learning Online Course. Also, Datasets For Machine Learning will also be beneficial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !