2 views

I am having trouble fully understanding the k-means++ algorithm. I am interested exactly how the first k centroids are picked (the rest is like in the original k-means).

1. Is the probability function used based on distance or Gaussian?

2. In the same time the most long distant point (from the other centroids) is picked for a new centroid.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also, a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

by (33.1k points)

In unsupervised learning, the cluster centers are initially chosen at random from the set of input data, where the probability of choosing vector x is high.

Code for Kmeans++:

def initialize(X, K):

C = [X]

for k in range(1, K):

D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])

probs = D2/D2.sum()

cumprobs = probs.cumsum()

r = scipy.rand()

for j,p in enumerate(cumprobs):

if r < p:

i = j

break

C.append(X[i])

return C

The output of the cumsum function gives boundaries to partition the interval [0,1]. These partitions have a length equal to the probability of the corresponding point being chosen as a center. Since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals. The for loop here checks to see which partition r is in.

For Example:

probs = [0.1, 0.2, 0.3, 0.4]

cumprobs = [0.1, 0.3, 0.6, 1.0]

if r < cumprobs:

# this event has probability 0.1

i = 0

elif r < cumprobs:

# this event has probability 0.2

i = 1

elif r < cumprobs:

# this event has probability 0.3

i = 2

elif r < cumprobs:

# this event has probability 0.4

i = 3