+2 votes

@Rony, Yes, it is possible to specify your own distance function.

K-means clustering is one of the most widely used unsupervised machine learning algorithms which generates clusters of data based on the similarity between various data instances. K-means algorithm starts by randomly choosing a centroid value for each cluster. After that the algorithm iteratively performs three steps:

1. Find the Euclidean distance between each data instance and centroids of all the clusters

2. Assign the data instances to the cluster of the centroid with the nearest distance

3. Calculate new centroid values based on the mean values of the coordinates of all the data instances from the corresponding cluster.

Use this example to understand how to specify your own distance function-

from pyclustering.cluster.kmeans import kmeans

from pyclustering.utils.metric import type_metric, distance_metric

myfunc = lambda point1, point2: point1[0] + point2[0] + 2

metric = distance_metric(type_metric.USER_DEFINED, func=myfunc)

scenters = [[2.9, 3.4], [5.9, 6.7]];

kinstance = kmeans(sample, start_centers, metric=metric)

kinstance.process()

clstr = kinstance.get_clusters()

0 votes

You just need to use nltk instead where you can do this:-

from nltk.cluster.kmeans import KMeansClusterer

NUM_CLUSTERS = <choose a value>

data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)

assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

You can use the following video tutorials to clear all your doubts:-

0 votes

Yes, it's is possible to specify own distance using scikit-learn

K-Means Clustering , which is a technique to partition the dataset into unique homogeneous clusters which are similar to each other but different than other clusters ,resultant clusters mutual exclusive i.e non-overlapping clusters .

this technique partition the observations based on the distance from cluster centroid . this distance can be Euclidean Distance , i never used my own distance from scikit-learn K-Means.

K-Means Clustering , which is a technique to partition the dataset into unique homogeneous clusters which are similar to each other but different than other clusters ,resultant clusters mutual exclusive i.e non-overlapping clusters .

this technique partition the observations based on the distance from cluster centroid . this distance can be Euclidean Distance , i never used my own distance from scikit-learn K-Means.