Explore Courses Blog Tutorials Interview Questions
0 votes
in AI and Deep Learning by (50.2k points)

recently I came to study clustering in data-mining and I've studied sequential clustering and hierarchical clustering and k-means.

I also read about a statement that distinguishes k-means from the other two clustering technique, saying k-means is not very good at dealing with nominal attributes, but the text didn't explain this point. So far, the only difference that I can see is that for K-means, we will know in advance we will need exactly K clusters while we don't know how many clusters we need for the other two clustering methods.

So could anybody give me some idea here on why such a statement exists,i.e.,k-means has this problem when dealing with examples of nominal attributes and is there a way to overcome this?

Thanks in advance.

1 Answer

0 votes
by (108k points)

K-Means deals with numerical data and in K-Modes, it uses a simple matching dissimilarity measure to deal with the categorical instances replacing the means of clusters with modes, and uses a frequency-based method to update those modes which are in the clustering process to minimize the clustering cost function.

If you want to stick with a K-Means variant you should check out the K-Prototypes algorithm which integrates the K-Means and K-Modes to allow for clustering instances described by mixed numeric and categorical attributes.

If a categorical variable is ordinal, you may try using it in the clustering algorithm as if it is numerical. If the variable is nominal, you have to create a binary variable for each category (=1 if the category is present, =0 if the category is absent). You may also have to standardize all the variables (to mean = 0 and variance = 1) before running the cluster analysis.

If you wish to know more about K-means clustering then visit this k-means Clustering Algorithm in Python.

Browse Categories