Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Data Science by (17.6k points)

I have a list of text files in a directory.

I'd like to create a matrix with the frequency of each word in the entire corpus in every file. (The corpus is every unique word in every file in the directory.)

Example:

File 1 - "aaa", "xyz", "cccc", "dddd", "aaa"  

File 2 - "abc", "aaa"

Corpus - "aaa", "abc", "cccc", "dddd", "xyz"  

Output matrix:

[[2, 0, 1, 1, 1],

 [1, 1, 0, 0, 0]]

My solution is to use collections.Counter over every file, get a dictionary with the count of every word, and initialize and a list of lists with size n × m (n = number of files, m = number of unique words in corpus). Then, I iterate over every file again to see the frequency of every word in the object, and fill each list with it.

Is there a better way to solve this problem? Maybe in a single pass using collections.Counter?

1 Answer

0 votes
by (41.4k points)

There’s a better way to solve this problem using sklearn.feature_extraction.DictVectorizer.

from sklearn.feature_extraction import DictVectorizer

from collections import Counter, OrderedDict

File_1 = ('aaa', 'xyz', 'cccc', 'dddd', 'aaa')

File_2 = ('abc', 'aaa')

v = DictVectorizer()

# discover corpus and vectorize file word frequencies in a single pass

X = v.fit_transform(Counter(f) for f in (File_1, File_2))

# or, if you have a pre-defined corpus and/or would like to restrict the words you consider

# in your matrix, you can do

# Corpus = ('aaa', 'bbb', 'cccc', 'dddd', 'xyz')

# v.fit([OrderedDict.fromkeys(Corpus, 1)])

# X = v.transform(Counter(f) for f in (File_1, File_2))

# X is a sparse matrix, but you can access the A property to get a dense numpy.ndarray 

# representation

print(X)

print(X.A)

<2x5 sparse matrix of type '<type 'numpy.float64'>'

        with 6 stored elements in Compressed Sparse Row format>

array([[ 2.,  0.,  1.,  1.,  1.],

       [ 1.,  1.,  0.,  0.,  0.]])

 Use V.vocabulary_ for accessing the mapping from words to indices.

{'aaa': 0, 'bbb': 1, 'cccc': 2, 'dddd': 3, 'xyz': 

4}

If you want to know What is Python visit this Python Course.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94k users

Browse Categories

...