Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I am having trouble fully understanding the k-means++ algorithm. I am interested exactly how the first k centroids are picked (the rest is like in the original k-means).

  1. Is the probability function used based on distance or Gaussian?

  2. In the same time the most long distant point (from the other centroids) is picked for a new centroid.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also, a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

1 Answer

0 votes
by (33.1k points)

In unsupervised learning, the cluster centers are initially chosen at random from the set of input data, where the probability of choosing vector x is high.

Code for Kmeans++:

def initialize(X, K):

    C = [X[0]]

    for k in range(1, K):

        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])

        probs = D2/D2.sum()

        cumprobs = probs.cumsum()

        r = scipy.rand()

        for j,p in enumerate(cumprobs):

            if r < p:

                i = j

                break

        C.append(X[i])

    return C

The output of the cumsum function gives boundaries to partition the interval [0,1]. These partitions have a length equal to the probability of the corresponding point being chosen as a center. Since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals. The for loop here checks to see which partition r is in.

For Example:

probs = [0.1, 0.2, 0.3, 0.4]

cumprobs = [0.1, 0.3, 0.6, 1.0]

if r < cumprobs[0]:

    # this event has probability 0.1

    i = 0

elif r < cumprobs[1]:

    # this event has probability 0.2

    i = 1

elif r < cumprobs[2]:

    # this event has probability 0.3

    i = 2

elif r < cumprobs[3]:

    # this event has probability 0.4

    i = 3

Hope this answer helps.

...