tag generation from a text content

Question

1 Answer

Anurag · Answer 1 · 2019-07-05T13:56:10+0000

The first approach to extract tags from the words that occur more frequently in a document. In larger documents, the TF-IDF method can be used to find more frequent words.

You can use the point-wise mutual information of the document to identify keywords.This is given by

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

To extract the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

Code for multi word tag extraction:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)

Hope this answer helps.

If you wish to Learn Python Completely, Visit This Python Tutorial.

tag generation from a text content

Please log in to add a comment.

Please log in to answer this question.

1 Answer

Please log in to add a comment.

Related questions