I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".
The problem is I can't find a way to do both. I only have these two partial options:
I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.
I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.
My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.
However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed.
So my question is: how to choose when to perform the lemmatization and which input to give to it?