+1 vote
in Python by (220 points)
Can I specify my own distance function using scikit-learn K-Means Clustering?

3 Answers

+2 votes
by (10.9k points)

@Rony, Yes, it is possible to specify your own distance function.

K-means clustering is one of the most widely used unsupervised machine learning algorithms which generates clusters of data based on the similarity between various data instances. K-means algorithm starts by randomly choosing a centroid value for each cluster. After that the algorithm iteratively performs three steps:

1. Find the Euclidean distance between each data instance and centroids of all the clusters

2. Assign the data instances to the cluster of the centroid with the nearest distance

3. Calculate new centroid values based on the mean values of the coordinates of all the data instances from the corresponding cluster.

Use this example to understand how to specify your own distance function-

from pyclustering.cluster.kmeans import kmeans

from pyclustering.utils.metric import type_metric, distance_metric

myfunc = lambda point1, point2: point1[0] + point2[0] + 2

metric = distance_metric(type_metric.USER_DEFINED, func=myfunc)

scenters = [[2.9, 3.4], [5.9, 6.7]];

kinstance = kmeans(sample, start_centers, metric=metric)


clstr = kinstance.get_clusters()

0 votes
by (35.4k points)

You just need to use nltk instead where you can do this:-

from nltk.cluster.kmeans import KMeansClusterer

NUM_CLUSTERS = <choose a value>

data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)

assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

You can use the following video tutorials to clear all your doubts:-

0 votes
by (140 points)
Yes, it's is possible to specify own distance using scikit-learn

K-Means Clustering , which is a technique to partition the dataset into unique homogeneous clusters which are similar to each other but different than other clusters ,resultant clusters mutual exclusive i.e non-overlapping clusters .

this technique partition the observations  based on the distance  from cluster centroid . this distance can be Euclidean Distance , i never used my own distance from scikit-learn K-Means.
Welcome to Intellipaat Community. Get your technical queries answered by top developers !