# Kmeans without knowing the number of clusters?

1 view

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters.

I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don't remember where I saw that. It would be great if someone can point me to any resources that discuss this. I am using SciPy for k-means currently but any related library would be fine as well.

If there are alternate ways of achieving the same or a better algorithm, please let me know.

by (33.2k points)

For your problem, if your dataset has a large number of features, then you can use Principal Component Analysis (PCA). It is a dimensionality reduction technique, which means it chooses highly co-related features based on the variance between them. Once you got k numbers of feature, you can define your k-means cluster value based on that.

Another approach to your problem is cross-validation. You choose a subset of data and k-means will make k number of clusters. Then you can compare clusters of a subset with the rest of the data.
K-means halts creating and optimizing clusters when either:

• The centroids have stabilized — there is no change in their values because the clustering has been successful.

• The defined number of iterations has been achieved.

For Example:

from sklearn.cluster import KMeans

Kmean = KMeans(n_clusters=2)

Kmean.fit(X)