# How exactly does k-means++ work?

0 votes
1 view

I am having trouble fully understanding the k-means++ algorithm. I am interested exactly how the first k centroids are picked (the rest is like in the original k-means).

1. Is the probability function used based on distance or Gaussian?

2. In the same time the most long distant point (from the other centroids) is picked for a new centroid.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also, a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

## 1 Answer

0 votes
by (33.2k points)

In unsupervised learning, the cluster centers are initially chosen at random from the set of input data, where the probability of choosing vector x is high.

Code for Kmeans++:

def initialize(X, K):

C = [X]

for k in range(1, K):

D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])

probs = D2/D2.sum()

cumprobs = probs.cumsum()

r = scipy.rand()

for j,p in enumerate(cumprobs):

if r < p:

i = j

break

C.append(X[i])

return C

The output of the cumsum function gives boundaries to partition the interval [0,1]. These partitions have a length equal to the probability of the corresponding point being chosen as a center. Since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals. The for loop here checks to see which partition r is in.

For Example:

probs = [0.1, 0.2, 0.3, 0.4]

cumprobs = [0.1, 0.3, 0.6, 1.0]

if r < cumprobs:

# this event has probability 0.1

i = 0

elif r < cumprobs:

# this event has probability 0.2

i = 1

elif r < cumprobs:

# this event has probability 0.3

i = 2

elif r < cumprobs:

# this event has probability 0.4

i = 3

Hope this answer helps.

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer