I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be bounding spheres with a specified radius). That means that there probably has to be many clusters of size 1.
But! I need the running time to be well below n^2. n log n or so should be fine. The reason I'm doing this clustering is to avoid computing a distance matrix of all n points (which takes n^2 time or many hours), instead, I want to just compute distances between clusters.
I tried the pycluster k-means algorithm but quickly realized it's way too slow. I've also tried the following greedy approach:
Slice space into 20 pieces in each dimension. (so there are 20^5 total pieces). I will store clusters in these grid boxes, according to their centroids.
For each point, retrieve the grid boxes that are within r (maximum bounding sphere radius). If there is a near enough cluster, add it to that cluster, otherwise make a new cluster.
However, this seems to give me more clusters than I want. I also implemented approaches similar to this twice, and they give very different answers.
Are there any standard approaches to clustering in faster than n^2 time? Probabilistic algorithms are ok.