2 views

I am trying to score similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?

I just tried Levenshtein, JaroWinkler, and others, but those are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and the other saying "I really hate dogs", we need to classify this case as totally different.

Thanks

by (108k points)

First of all, there are two types of text similarities:

1. lexical similarity means the surface closeness of two sentences or texts

2. semantic similarity means the meaning of the two sentences or texts

There are many algorithms in java that can detect the similarity between any two texts. They are as follows:

• Levenshtein: The purpose of this algorithm is to measure the difference between two sequences/strings. The algorithm is based on the number of changes required to make one string equal to the other. It can aim only at short strings, its usage is spell checkers, optical character recognition, etc.

• Damerau Levenshtein: It is similar to the above algorithm. This algorithm is based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters in the string. This algorithm is aimed at spell checkers and also used for DNA sequences.

• Hamming: This algorithm helps in calculating the distance between two strings, but there is one condition that is both the strings should be of equal length. It measures the minimum number of substitutions for the two strings to be equal. It is used in telecommunication (also know as signal distance), it is also used in systematics as a measure of genetic distance.

• Jaro Winkler: This algorithm is mainly designed for record linkage, it was designed for linking short strings. It calculates a normalized score on the similarity between two strings. The calculation is based on the number of matching characters held within the string and the number of transpositions.

• N-Gram: An algorithm to calculate the probability of the next term based on the previous n terms. It is used in speech recognition, phonemes, language recognition, etc.

• Markov Chain: The Markov Chain model calculates the probability of the next letter occurring in a sequence based on the previous n characters. They are used in a multitude of areas, including data compression, entropy encoding, and pattern recognition.

For the programs of all the above algorithms, refer the following link:

https://blogs.ucl.ac.uk/chime/2010/06/28/java-example-code-of-common-similarity-algorithms-used-in-data-mining/

You can also refer to this link for text similarities that estimate the degree of similarity between two texts:

If you are looking to learn more about Artificial Intelligence then you visit Artificial Intelligence(AI) Tutorial. Also, if you are appearing for job profiles of AI Engineer or AI Expert then you can prepare for the interviews on Artificial Intelligence Interview Questions.