2 views

I'm trying to calculate the  Davies-Bouldin Index  in Python.

Here are the steps the code below tries to reproduce.

5 Steps:

For each cluster, compute euclidean distances between each point to the centroid

For each cluster, compute the average of these distances

For each pair of clusters, compute the euclidean distance between their centroids

Then,

For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance separating them (computed at step 3).

Finally,

Compute the mean of all these divisions (= all indexes) to get the Davies-Bouldin index of the whole clustering

Code

def daviesbouldin(X, labels, centroids):

import numpy as np

from scipy.spatial.distance import pdist, euclidean

nbre_of_clusters = len(centroids) #Get the number of clusters

distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster

distances_means = [] #Store the mean of these distances

DB_indexes = [] #Store Davies_Boulin index of each pair of cluster

second_cluster_idx = [] #Store index of the second cluster of each pair

first_cluster_idx = 0 #Set index of first cluster of each pair to 0

# Step 1: Compute euclidean distances between each point of a cluster to their centroid

for cluster in range(nbre_of_clusters):

for point in range(X[labels == cluster].shape):

distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))

# Step 2: Compute the mean of these distances

for e in distances:

distances_means.append(np.mean(e))

# Step 3: Compute euclidean distances between each pair of centroid

ctrds_distance = pdist(centroids)

# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster

for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):

second_cluster_idx.append(e)

if second_cluster_idx[i-1] == nbre_of_clusters - 1:

first_cluster_idx += 1

DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])

# Step 5: Compute the mean of all DB_indexes

print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))

In arguments:

1.X is the data

2.labels, are the labels computed by a clustering algorithm (i.e: kmeans)

3.centroids are the coordinates of each cluster's centroid (i.e: cluster_centers_)

Also, note that I'm using Python 3

QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?

QUESTION2: Is my implementation of step 4 correct?

QUESTION3: Do I need to normalise intra and inter cluster distances ?

Further explanations on Step 4

Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.

At the first iteration:

sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 2 (index 1 of distances_means)

divides this sum by the distance between the 2 clusters (index 0 of ctrds_distance)

At the second iteration:

sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 3 (index 2 of distances_means)

divides this sum by the distance between the 2 clusters (index 1 of ctrds_distance)

and so on...

With the example of 10 clusters, the full iteration process should look like this:

intra-cluster distance intra-cluster distance       distance between their

of cluster:             of cluster:           centroids(storage num):

0           +             1            /             0

0           +             2            /             1

0           +             3            /             2

0           +             4            /             3

0           +             5            /             4

0           +             6            /             5

0           +             7            /             6

0           +             8            /             7

0           +             9            /             8

1           +             2            /             9

1           +             3            /             10

1           +             4            /             11

1           +             5            /             12

1           +             6            /             13

1           +             7            /             14

1           +             8            /             15

1           +             9            /             16

2           +             3            /             17

2           +             4            /             18

2           +             5            /             19

2           +             6            /             20

2           +             7            /             21

2           +             8            /             22

2           +             9            /             23

3           +             4            /             24

3           +             5            /             25

3           +             6            /             26

3           +             7            /             27

3           +             8            /             28

3           +             9            /             29

4           +             5            /             30

4           +             6            /             31

4           +             7            /             32

4           +             8            /             33

4           +             9            /             34

5           +             6            /             35

5           +             7            /             36

5           +             8            /             37

5           +             9            /             38

6           +             7            /             39

6           +             8            /             40

6           +             9            /             41

7           +             8            /             42

7           +             9            /             43

8           +             9            /             44

The problem here is I'm not quite sure that the index of distances_means matches the index of ctrds_distance.

In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.

In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.

Any help is more than welcomed !

by (41.4k points)

The below code is much faster and shorter version of Davies-Bouldin index naive implementation.

def DaviesBouldin(X, labels):

n_cluster = len(np.bincount(labels))

cluster_k = [X[labels == k] for k in range(n_cluster)]

centroids = [np.mean(k, axis = 0) for k in cluster_k]

variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]

db = []

for i in range(n_cluster):

for j in range(n_cluster):

if j != i:

db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

return(np.max(db) / n_cluster)

So, here are the mistakes encountered.

1. In step 4, the counter on the first draft was correct but irrelevant.

2.There is no need to normalise intra and inter cluster distances.

3.There was a mistake while calculating Euclidean distances.

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.