Back

Explore Courses Blog Tutorials Interview Questions
+1 vote
2 views
in Python by (19.9k points)

I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".

The problem is I can't find a way to do both. I only have these two partial options:

I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.

I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.

My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.

However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed.

So my question is: how to choose when to perform the lemmatization and which input to give to it?

1 Answer

0 votes
by (25.1k points)
edited by

You can use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1. So your issue would be resolved by upgrading the spacy package.

To know more about this you can have a look at the following video:-

Related questions

0 votes
2 answers
0 votes
1 answer
asked Apr 13, 2021 in Java by sheela_singh (9.5k points)
0 votes
1 answer
asked Feb 17, 2021 in Java by Jake (7k points)
+2 votes
1 answer

Browse Categories

...