Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I have a list of text files in a directory.

I'd like to create a matrix with the frequency of each word in the entire corpus in every file. (The corpus is every unique word in every file in the directory.)

Example:

File 1 - "aaa", "xyz", "cccc", "dddd", "aaa"  

File 2 - "abc", "aaa"

Corpus - "aaa", "abc", "cccc", "dddd", "xyz"  

Output matrix:

[[2, 0, 1, 1, 1],

 [1, 1, 0, 0, 0]]

My solution is to use collections.Counter over every file, get a dictionary with the count of every word, and initialize and a list of lists with size n × m (n = number of files, m = number of unique words in corpus). Then, I iterate over every file again to see the frequency of every word in the object, and fill each list with it.

Is there a better way to solve this problem? Maybe in a single pass using collections.Counter?

1 Answer

0 votes
by (41.4k points)

There’s a better way to solve this problem using sklearn.feature_extraction.DictVectorizer.

from sklearn.feature_extraction import DictVectorizer

from collections import Counter, OrderedDict

File_1 = ('aaa', 'xyz', 'cccc', 'dddd', 'aaa')

File_2 = ('abc', 'aaa')

v = DictVectorizer()

# discover corpus and vectorize file word frequencies in a single pass

X = v.fit_transform(Counter(f) for f in (File_1, File_2))

# or, if you have a pre-defined corpus and/or would like to restrict the words you consider

# in your matrix, you can do

# Corpus = ('aaa', 'bbb', 'cccc', 'dddd', 'xyz')

# v.fit([OrderedDict.fromkeys(Corpus, 1)])

# X = v.transform(Counter(f) for f in (File_1, File_2))

# X is a sparse matrix, but you can access the A property to get a dense numpy.ndarray 

# representation

print(X)

print(X.A)

<2x5 sparse matrix of type '<type 'numpy.float64'>'

        with 6 stored elements in Compressed Sparse Row format>

array([[ 2.,  0.,  1.,  1.,  1.],

       [ 1.,  1.,  0.,  0.,  0.]])

 Use V.vocabulary_ for accessing the mapping from words to indices.

{'aaa': 0, 'bbb': 1, 'cccc': 2, 'dddd': 3, 'xyz': 

4}

If you want to know What is Python visit this Python Course.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...