Determining the best k for a k nearest neighbour

Question

asked Jul 15, 2019 in AI and Deep Learning by ashely (50.2k points)

I need to do some cluster analysis on a set of 2-dimensional data (I may add extra dimensions along the way).

The analysis itself will form part of the data being fed into a visualization, rather than the inputs into another process (e.g. Radial Basis Function Networks).

To this end, I'd like to find a set of clusters that primarily "looks right", rather than elucidating some hidden patterns.

My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm would be problematic.

The problem I'm coming to is this:

How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?

Questions:

Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in a number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend? A specific example, backed by an anecdote/experience would be maxi-box.
what short cuts/approximations would you recommend increasing the performance.

1 Answer

vinita · Answer 1 · 2019-07-15T12:27:44+0000

There is a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.

Weka includes an implementation of X-Means along with many other clustering algorithms, all in an easy to use GUI tool.

You can follow the following steps for the solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If the stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.

The main idea behind this algorithm is that the number of sets of k clusters is small for "good" values of k.

If you wish to know more about K-means Clustering Algorithm then visit this K-means Clustering Algorithm Tutorial.

Determining the best k for a k nearest neighbour

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources