Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (4.2k points)
I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

My question: is there better techniques to cluster documents?

1 Answer

0 votes
by (6.8k points)

TfxIdf is currently one of the most famous search method. What you need is some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for English (for example the lib 'nltk' in python).

You must use the NLP analysis both on your queries (questions) and on yours documents before indexing.

The point is: whereas TF-IDF (or tfxidf^2 like in Lucene) is good, you should use it on the annotated resources with meta-linguistics information. That can be arduous and need in-depth data regarding your core program, grammar analysis (syntax) and the domain of document.

cosine similarity on latent linguistics analysis (LSA/LSI) vectors works loads higher than raw tf-IDF for text cluster, though I admit I haven't tried it on Twitter data.

Topic models like LDA would possibly work even higher.

Since they are a part of Machine Learning Course, understanding Tf-idf will open a lot of gateways for a Machine Learning newbie. 

Browse Categories