2 views

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1.

But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours), instead, I want to just compute distances between clusters.

I tried the pycluster k-means algorithm but quickly realized it's way too slow. I've also tried the following greedy approach:

Slice space into 20 pieces in each dimension. (so there are 20^5 total pieces). I will store clusters in these grid boxes, according to their centroids.

For each point, retrieve the grid boxes that are within r (maximum bounding sphere radius). If there is a near enough cluster, add it to that cluster, otherwise make a new cluster.

However, this seems to give me more clusters than I want. I also implemented approaches similar to this twice, and they give very different answers.

Are there any standard approaches to clustering in faster than n^2 time? Probabilistic algorithms are ok.

by (33.1k points)

There are some methods like approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). These algorithms are not clustering problem, but they can be used to get similarity between points. You can also define the similarity difference between two points.

LSH provides a hash function, h, such that, for two points x and y, and distance metric d,

d(x,y) <= R1  => P(h(x) = h(y)) >= P1

d(x,y) >= R2  => P(h(x) = h(y)) <= P2

where R1 < R2 and P1 > P2. because it is probabilistic. You can post-process the retrieved data to arrive at true clusters.

Hope this answer helps.