2 views

I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means.

Here is what I have understood, please let me know if my concept is wrong:

GMM is like KNN, in the sense that clustering is achieved in both cases. But in GMM each cluster has it's own independent mean and covariance. Furthermore, k-means performs hard assignments of data points to clusters whereas in GMM we get a collection of independent Gaussian distributions, and for each data point, we have a probability that it belongs to one of the distributions.

To understand it better I have used MatLab to code it and achieve the desired clustering. I have used SIFT features for the purpose of feature extraction. And have used k-means clustering to initialize the values. (This is from the VLFeat documentation)

%images is a 459 x 1 cell array where each cell contains the training image

[locations, all_feats] = vl_dsift(single(images{1}), 'fast', 'step', 50); %all_feats will be 128 x no. of keypoints detected

for i=2:(size(images,1))

[locations, feats] = vl_dsift(single(images{i}), 'fast', 'step', 50);

all_feats = cat(2, all_feats, feats); %cat column wise all features

end

numClusters = 50; %Just a random selection.

% Run KMeans to pre-cluster the data

[initMeans, assignments] = vl_kmeans(single(all_feats), numClusters, ...

'Algorithm','Lloyd', ...

'MaxNumIterations',5);

initMeans = double(initMeans); %GMM needs it to be double

% Find the initial means, covariances and priors

for i=1:numClusters

data_k = all_feats(:,assignments==i);

initPriors(i) = size(data_k,2) / numClusters;

if size(data_k,1) == 0 || size(data_k,2) == 0

initCovariances(:,i) = diag(cov(data'));

else

initCovariances(:,i) = double(diag(cov(double((data_k')))));

end

end

% Run EM starting from the given parameters

[means,covariances,priors,ll,posteriors] = vl_gmm(double(all_feats), numClusters, ...

'initialization','custom', ...

'InitMeans',initMeans, ...

'InitCovariances',initCovariances, ...

'InitPriors',initPriors);

Based on the above I have means, covariances, and priors. My main question is, What now? I am kind of lost now.

Also the means, covariances vectors are each of the size 128 x 50. I was expecting them to be 1 x 50 since each column is a cluster, won't each cluster have only one mean and covariance? (I know 128 are the SIFT features but I was expecting means and covariances).

In k-means I used the MatLab command knnsearch(X,Y) which basically finds the nearest neighbour in X for each point in Y.

So how to achieve this in GMM, I know its a collection of probabilities, and of course the nearest match from that probability will be our winning cluster. And this is where I am confused. All tutorials online have taught how to achieve the means, covariances values, but do not say much in how to actually use them in terms of clustering.

Thank you

by (33.1k points)

A Gaussian Mixture is a function that is comprised of many Gaussians, each identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset. Each Gaussian k in the mixture is comprised of the following parameters:

• A mean μ that defines its center.

• A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.

• A mixing probability π that defines how big or small the Gaussian function will be.

Code:

% project it down to 2 dimensions for the sake of visualization

[~,data] = pca(meas,'NumComponents',2);

mn = min(data); mx = max(data);

D = size(data,2);    % data dimension

% inital kmeans step used to initialize EM

K = 3;               % number of mixtures/clusters

cInd = kmeans(data, K, 'EmptyAction','singleton');

% fit a GMM model

gmm = fitgmdist(data, K, 'Options',statset('MaxIter',1000), ...

'CovType','full', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);

% means, covariances, and mixing-weights

mu = gmm.mu;

sigma = gmm.Sigma;

p = gmm.PComponents;

% cluster and posterior probablity of each instance

% note that: [~,clustIdx] = max(p,[],2)

[clustInd,~,p] = cluster(gmm, data);

tabulate(clustInd)

% plot data, clustering of the entire domain, and the GMM contours

clrLite = [1 0.6 0.6 ; 0.6 1 0.6 ; 0.6 0.6 1];

clrDark = [0.7 0 0 ; 0 0.7 0 ; 0 0 0.7];

[X,Y] = meshgrid(linspace(mn(1),mx(1),50), linspace(mn(2),mx(2),50));

C = cluster(gmm, [X(:) Y(:)]);

image(X(:), Y(:), reshape(C,size(X))), hold on

gscatter(data(:,1), data(:,2), species, clrDark)

h = ezcontour(@(x,y)pdf(gmm,[x y]), [mn(1) mx(1) mn(2) mx(2)]);

set(h, 'LineColor','k', 'LineStyle',':')

hold off, axis xy, colormap(clrLite)

title('2D data and fitted GMM'), xlabel('PC1'), ylabel('PC2')