Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (50.2k points)

I want to split a text format data into n-grams. Usually, I would do something like:

import nltk

from nltk import bigrams

string = "I really like python, it's pretty awesome."

string_bigrams = bigrams(string)

print string_bigrams

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text into four-grams, five-grams, or even hundred-grams?

1 Answer

0 votes
by (108k points)

There is an ngram package in python that people sometimes use in nltk. It is because it will train a model base on ngrams where n is greater than 3, and it will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo-bar sentence and i want to ngramize it'

n = 6

sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:

  print grams

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

Browse Categories

...