Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (6.8k points)

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.

As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.

This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!

Problem:

Match words to embedding vectors created by Word2Vec () -- how do I do it?

My approach:

After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0. Thus

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
  • Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?

  • Does this even get me correct results?

*My code to create the word vectors:

model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)

model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings

model.save("newmodel")

I found this question which is similar but has not really been answered.

1 Answer

0 votes
by (6.8k points)

If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.

If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:

import gensim

documents = [['human', 'interface', 'computer'],

 ['survey', 'user', 'computer', 'system', 'response', 'time'],

 ['eps', 'user', 'interface', 'system'],

 ['system', 'human', 'system', 'eps'],

 ['user', 'response', 'time'],

 ['trees'],

 ['graph', 'trees'],

 ['graph', 'minors', 'trees'],

 ['graph', 'minors', 'survey']]

 

model = gensim.models.Word2Vec(documents, min_count=1)

print model.similar_by_vector(model["survey"], topn=1)

image

where the number represents the similarity.

However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.

Machine Learning is one of the topics where this is associated. So, studying a Machine Learning Course will be giving you a broader outreach on this.

...