Create sparse word matrix in Python (bag-of-words)

Question

asked Jul 20, 2019 in Data Science by sourav (17.6k points)

I have a list of text files in a directory.

I'd like to create a matrix with the frequency of each word in the entire corpus in every file. (The corpus is every unique word in every file in the directory.)

Example:

File 1 - "aaa", "xyz", "cccc", "dddd", "aaa"
File 2 - "abc", "aaa"
Corpus - "aaa", "abc", "cccc", "dddd", "xyz"

Output matrix:

[[2, 0, 1, 1, 1],
[1, 1, 0, 0, 0]]

My solution is to use collections.Counter over every file, get a dictionary with the count of every word, and initialize and a list of lists with size n × m (n = number of files, m = number of unique words in corpus). Then, I iterate over every file again to see the frequency of every word in the object, and fill each list with it.

Is there a better way to solve this problem? Maybe in a single pass using collections.Counter?

1 Answer

Shlok Pandey · Answer 1 · 2019-07-27T14:01:58+0000

There’s a better way to solve this problem using sklearn.feature_extraction.DictVectorizer.

from sklearn.feature_extraction import DictVectorizer
from collections import Counter, OrderedDict
File_1 = ('aaa', 'xyz', 'cccc', 'dddd', 'aaa')
File_2 = ('abc', 'aaa')
v = DictVectorizer()
# discover corpus and vectorize file word frequencies in a single pass
X = v.fit_transform(Counter(f) for f in (File_1, File_2))
# or, if you have a pre-defined corpus and/or would like to restrict the words you consider
# in your matrix, you can do
# Corpus = ('aaa', 'bbb', 'cccc', 'dddd', 'xyz')
# v.fit([OrderedDict.fromkeys(Corpus, 1)])
# X = v.transform(Counter(f) for f in (File_1, File_2))
# X is a sparse matrix, but you can access the A property to get a dense numpy.ndarray
# representation
print(X)
print(X.A)
<2x5 sparse matrix of type '<type 'numpy.float64'>'

with 6 stored elements in Compressed Sparse Row format>

array([[ 2., 0., 1., 1., 1.],
[ 1., 1., 0., 0., 0.]])

Use V.vocabulary_ for accessing the mapping from words to indices.

{'aaa': 0, 'bbb': 1, 'cccc': 2, 'dddd': 3, 'xyz':
4}

If you want to know What is Python visit this Python Course.

Create sparse word matrix in Python (bag-of-words)

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources