Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (33.1k points)

Having this:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

And running:


I get:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

This is incorrect. The tags for quick brown lazy in the sentence should be:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

Testing this through their online tool gives the same result; quick, brown and fox should be adjectives, not nouns.

1 Answer

0 votes
by (33.1k points)

For your problem, if I say you can use the NLTK library, then I’d also want to say that there is not any perfect method in machine learning that can fit your model properly. So you have to try some different techniques also to get the best accuracy on unknown data.

There is a class in NLTK called perceptron tagger, which can help your model to return correct parts of speech.

>>> import inspect

>>> print inspect.getsource(pos_tag)

def pos_tag(tokens, tagset=None):

    tagger = PerceptronTagger()

    return _pos_tag(tokens, tagset, tagger) 

Still, it's better but not perfect:

>>> from nltk import pos_tag

>>> pos_tag("The quick brown fox jumps over the lazy dog".split())

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

The above code will improve the output of your model.

I hope this answer helps.

Browse Categories