Back

Explore Courses Blog Tutorials Interview Questions
+1 vote
2 views
in Machine Learning by (4.2k points)

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:

0        the        6
10        chased        3
110        dog        2
1110        mouse        2
1111        cat        2

What does the binary and the integer mean?

From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/

But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?

1 Answer

+1 vote
by (6.8k points)

The algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.

For example, cutting at the second character gives you two clusters

10           chased     

11           dog        

11           mouse      

11           cat        

At the third character you get

110           dog        

111           mouse      

111           cat        

The cutting strategy is a different subject though. If you want to know more about Brown Clustering, studying the Machine Learning Course is the best. 

Browse Categories

...