The first approach to extract tags from the words that occur more frequently in a document. In larger documents, the TF-IDF method can be used to find more frequent words.
You can use the point-wise mutual information of the document to identify keywords.This is given by
PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
To extract the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
Code for multi word tag extraction:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
Hope this answer helps.
If you wish to Learn Python Completely, Visit This Python Tutorial.