Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)
Say I have this text = I love apples, kiwis, oranges and bananas and the searchString = kiwis and bananas and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text which has the highest similarity to searchString.

Basically I am trying to find portions of text (text has high errors, misspellings, extra symbols and spaces) which match a list of keywords I have.

1 Answer

0 votes
by (33.1k points)
Jaccard index is a "lucky" similarity algorithm because you can update its value for a new symbol without recalculating all previous stuff. So, you can view the text as a sequence of diffs for the resulting index value. After that, the problem can be reduced to https://en.wikipedia.org/wiki/Maximum_subarray_problem.

In your second paragraph, if you are doing some NLP-like research, I'd suggest cleaning your data (remove those extra symbols and spaces, whenever that's possible) before further processing. That's known as "spelling correction", and there are tons of different algorithms and libraries. To choose the appropriate one, extra information about your domain is needed.

Browse Categories

...