Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I am curious if there is an algorithm/method that exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.

Additionally, I will be grateful if you point any Python-based solution/library for this.

Thanks

1 Answer

0 votes
by (33.1k points)

The first approach to extract tags from the words that occur more frequently in a document. In larger documents, the TF-IDF method can be used to find more frequent words.

You can use the point-wise mutual information of the document to identify keywords.This is given by

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

To extract the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

Code for multi word tag extraction:

import nltk

from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

# change this to read in your data

finder = BigramCollocationFinder.from_words(

   nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times

finder.apply_freq_filter(3) 

# return the 5 n-grams with the highest PMI

finder.nbest(bigram_measures.pmi, 5)  

Hope this answer helps.

If you wish to Learn Python Completely, Visit This Python Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers

501 comments

104k users

Browse Categories

...