Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am new to NLP and learning about the Tool kit of Natural Language. I am working on categorizing the words into Persons, Places, and Organization.

Till now I have defined a single line of text works in the script.

ex = 'John'

ne_tree =  nltk.ne_chunk(pos_tag(word_tokenize(ex)))

print(ne_tree)

Output:

(S (PERSON John/NNP))

The problem is I am not able to use this for the entire column.

My table looks as shown below:

Order   Text

    0   John

    1   Chicago

    2   stuff

    3   question

In the above table order is nothing but the index of the table. I am trying to break down the sentence into words and then give a key to it and give a token to the text.

But when I execute the code it gives me an error as shown below:

ne_tree =  nltk.ne_chunk(pos_tag(word_tokenize(ex2)))

print(ne_tree)

ERROR:

TypeError                                 Traceback (most recent call last)

<ipython-input-80-5d4582e937dd> in <module>

----> 1 ne_tree =  nltk.ne_chunk(pos_tag(word_tokenize(ex3)))

      2 print(ne_tree)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language, preserve_line)

    142     :type preserve_line: bool

    143     """

--> 144     sentences = [text] if preserve_line else sent_tokenize(text, language)

    145     return [

    146         token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)

    104     """

    105     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))

--> 106     return tokenizer.tokenize(text)

    107 

    108 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)

   1275         Given a text, returns a list of the sentences in that text.

   1276         """

-> 1277         return list(self.sentences_from_text(text, realign_boundaries))

   1278 

   1279     def debug_decisions(self, text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)

   1329         follows the period.

   1330         """

-> 1331         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]

   1332 

   1333     def _slices_from_text(self, text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)

   1329         follows the period.

   1330         """

-> 1331         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]

   1332 

   1333     def _slices_from_text(self, text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)

   1319         if realign_boundaries:

   1320             slices = self._realign_boundaries(text, slices)

-> 1321         for sl in slices:

   1322             yield (sl.start, sl.stop)

   1323 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)

   1360         """

   1361         realign = 0

-> 1362         for sl1, sl2 in _pair_iter(slices):

   1363             sl1 = slice(sl1.start + realign, sl1.stop)

   1364             if not sl2:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)

    316     it = iter(it)

    317     try:

--> 318         prev = next(it)

    319     except StopIteration:

    320         return

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)

   1333     def _slices_from_text(self, text):

   1334         last_break = 0

-> 1335         for match in self._lang_vars.period_context_re().finditer(text):

   1336             context = match.group() + match.group('after_tok')

   1337             if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

1 Answer

0 votes
by (36.8k points)

Its very simple, one thing you need to do is that for each row you need to apply the function as shown below:

ex2['results'] = ex2.Text.apply(lambda x: nltk.ne_chunk(pos_tag(word_tokenize(x))))

I hope this will help you. Improve your knowledge in data science from scratch using Data Scientist

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...