I'm currently in the process of developing a program with the capability of comparing a small text (say 250 characters) to a collection of similar texts (around 1000-2000 texts).
The purpose is to evalute if text A is similar to one or more texts in the collection and if so, the text in the collection has to be retrievable by ID. Each texts will have a unique ID.
There is two ways I'd like the output to be:
Option 1: Text A matched Text B with 90% similarity, Text C with 70% similarity, and so on.
Option 2: Text A matched Text D with highest similarity
I have read some machine learning in school but I'm not sure which algorithm suits this problem the best or if I should consider using NLP (not familiar with the subject).
Does anyone have a suggestion of what algorithm to use or where I can find the nessecary literature to solve my problem?
Thanks for any contribution!