Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

dataset is pandas dataframe. This is sklearn.cluster.KMeans

 km = KMeans(n_clusters = n_Clusters)

 km.fit(dataset)

 prediction = km.predict(dataset)

This is how I decide which entity belongs to which cluster:

 for i in range(len(prediction)):

     cluster_fit_dict[dataset.index[i]] = prediction[i]

This is how dataset looks:

 A 1 2 3 4 5 6

 B 2 3 4 5 6 7

 C 1 4 2 7 8 1

 ...

where A,B,C are indices

Is this the correct way of using k-means?

1 Answer

0 votes
by (41.4k points)

To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:

dataset_array = dataset.values

print(dataset_array.dtype)

print(dataset_array)

If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn .preprocessing.StandardScaler for instance.

If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

Browse Categories

...