Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)

I'm currently in the process of developing a program with the capability of comparing a small text (say 250 characters) to a collection of similar texts (around 1000-2000 texts).

The purpose is to evalute if text A is similar to one or more texts in the collection and if so, the text in the collection has to be retrievable by ID. Each texts will have a unique ID.

There is two ways I'd like the output to be:

Option 1: Text A matched Text B with 90% similarity, Text C with 70% similarity, and so on.

Option 2: Text A matched Text D with highest similarity

I have read some machine learning in school but I'm not sure which algorithm suits this problem the best or if I should consider using NLP (not familiar with the subject).

Does anyone have a suggestion of what algorithm to use or where I can find the nessecary literature to solve my problem?

Thanks for any contribution!

1 Answer

0 votes
by (33.1k points)

It does not seem like a machine learning problem, you are simply looking for text similarity measure. You should just sort your data according to achieved scores.

You can use one of the following metrics:

  • Hamming distance
  • Levenshtein distance and Damerau–Levenshtein distance
  • Needleman–Wunsch distance or Sellers' algorithm
  • Smith-Waterman distance
  • Gotoh distance or Smith-Waterman-Gotoh distance
  • Monge Elkan distance
  • Block distance or L1 distance or City block distance
  • Jaro–Winkler distance
  • Soundex distance metric
  • Simple matching coefficient (SMC)
  • Dice's coefficient
  • Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
  • Tversky index
  • Overlap coefficient
  • Euclidean distance or L2 distance
  • Cosine similarity
  • Variational distance
  • Hellinger distance or Bhattacharyya distance
  • Information radius (Jensen–Shannon divergence)
  • Skew divergence
  • Confusion probability
  • Tau metric, an approximation of the Kullback–Leibler divergence
  • Fellegi and Sunter metric (SFS)
  • Maximal matches
  • Lee distance

Some of the above techniques require transforming your data into the vectorized format. This process can also be achieved in many ways, with the simplest possible bag of words or tfidf techniques.

There are many string kernels, which are also suited for measuring text similarity. In particular, Wordnet Kernel can measure semantic similarity based on one of the most complete semantic databases of the English language.

I hope this answer helps you!

Also to work on Machine Learning problems and projects, you can join Machine Learning Certification courses.

Browse Categories