I am curious if there is an algorithm/method that exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.

Additionally, I will be grateful if you point any Python-based solution/library for this.


The first approach to extract tags from the words that occur more frequently in a document. In larger documents, the TF-IDF method can be used to find more frequent words.

You can use the point-wise mutual information of the document to identify keywords.This is given by

PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]

To extract the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.

Code for multi word tag extraction:

import nltk

from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

# change this to read in your data

finder = BigramCollocationFinder.from_words(


# only bigrams that appear 3+ times


# return the 5 n-grams with the highest PMI

finder.nbest(bigram_measures.pmi, 5)  

