In unsupervised learning, the cluster centers are initially chosen at random from the set of input data, where the probability of choosing vector x is high.
Code for Kmeans++:
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
The output of the cumsum function gives boundaries to partition the interval [0,1]. These partitions have a length equal to the probability of the corresponding point being chosen as a center. Since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals. The for loop here checks to see which partition r is in.
For Example:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3
Hope this answer helps.