I have been looking at the NLP tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question.
In the meantime though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following:
First of all, neither from the perspective of computational
linguistics nor of theoretical linguistics is it clear what
the term 'semantic similarity' means exactly. ....
Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact
opposite of 1, still it is about Pete and Rob (not) finding a
dog.
My high-level requirement is to utilize k-means clustering and categorize the text based on semantic similarity so all I need to know is whether they are an approximate match. For instance, in the above example, I am OK with classifying 1,2,4,5 into one category and 3 into another (of course, 3 will be backed up with some more similar sentences). Something like, find related articles, but they don't have to be 100% related.
I am thinking I need to ultimately construct vector representations of each sentence, sort of like its fingerprint but exactly what this vector should contain is still an open question for me. Is it n-grams or something from the wordnet or just the individual stemmed words or something else altogether?
This thread did a fantastic job of enumerating all related techniques but unfortunately stopped just when the post got to what I wanted. Any suggestions on what is the latest state-of-the-art in this area?