Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I am curious if there is an algorithm/method that exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.

Additionally, I will be grateful if you point any Python-based solution/library for this.

Thanks

1 Answer

0 votes
by (33.1k points)

The first approach to extract tags from the words that occur more frequently in a document. In larger documents, the TF-IDF method can be used to find more frequent words.

You can use the point-wise mutual information of the document to identify keywords.This is given by

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

To extract the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

Code for multi word tag extraction:

import nltk

from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

# change this to read in your data

finder = BigramCollocationFinder.from_words(

   nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times

finder.apply_freq_filter(3) 

# return the 5 n-grams with the highest PMI

finder.nbest(bigram_measures.pmi, 5)  

Hope this answer helps.

If you wish to Learn Python Completely, Visit This Python Tutorial.

Browse Categories

...