0 votes
1 view
in Machine Learning by (19k points)

I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means.

Here is what I have understood, please let me know if my concept is wrong:

GMM is like KNN, in the sense that clustering is achieved in both cases. But in GMM each cluster has it's own independent mean and covariance. Furthermore, k-means performs hard assignments of data points to clusters whereas in GMM we get a collection of independent Gaussian distributions, and for each data point, we have a probability that it belongs to one of the distributions.

To understand it better I have used MatLab to code it and achieve the desired clustering. I have used SIFT features for the purpose of feature extraction. And have used k-means clustering to initialize the values. (This is from the VLFeat documentation)

%images is a 459 x 1 cell array where each cell contains the training image

[locations, all_feats] = vl_dsift(single(images{1}), 'fast', 'step', 50); %all_feats will be 128 x no. of keypoints detected

for i=2:(size(images,1))

    [locations, feats] = vl_dsift(single(images{i}), 'fast', 'step', 50);

    all_feats = cat(2, all_feats, feats); %cat column wise all features

end

numClusters = 50; %Just a random selection.

% Run KMeans to pre-cluster the data

[initMeans, assignments] = vl_kmeans(single(all_feats), numClusters, ...

    'Algorithm','Lloyd', ...

    'MaxNumIterations',5);

initMeans = double(initMeans); %GMM needs it to be double

% Find the initial means, covariances and priors

for i=1:numClusters

    data_k = all_feats(:,assignments==i);

    initPriors(i) = size(data_k,2) / numClusters;

    if size(data_k,1) == 0 || size(data_k,2) == 0

        initCovariances(:,i) = diag(cov(data'));

    else

        initCovariances(:,i) = double(diag(cov(double((data_k')))));

    end

end

% Run EM starting from the given parameters

[means,covariances,priors,ll,posteriors] = vl_gmm(double(all_feats), numClusters, ...

    'initialization','custom', ...

    'InitMeans',initMeans, ...

    'InitCovariances',initCovariances, ...

    'InitPriors',initPriors);

Based on the above I have means, covariances, and priors. My main question is, What now? I am kind of lost now.

Also the means, covariances vectors are each of the size 128 x 50. I was expecting them to be 1 x 50 since each column is a cluster, won't each cluster have only one mean and covariance? (I know 128 are the SIFT features but I was expecting means and covariances).

In k-means I used the MatLab command knnsearch(X,Y) which basically finds the nearest neighbour in X for each point in Y.

So how to achieve this in GMM, I know its a collection of probabilities, and of course the nearest match from that probability will be our winning cluster. And this is where I am confused. All tutorials online have taught how to achieve the means, covariances values, but do not say much in how to actually use them in terms of clustering.

Thank you

1 Answer

0 votes
by (33.2k points)

A Gaussian Mixture is a function that is comprised of many Gaussians, each identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset. Each Gaussian k in the mixture is comprised of the following parameters:

  • A mean μ that defines its center.

  • A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.

  • A mixing probability π that defines how big or small the Gaussian function will be.

Code:

% load Fisher Iris dataset

load fisheriris

% project it down to 2 dimensions for the sake of visualization

[~,data] = pca(meas,'NumComponents',2);

mn = min(data); mx = max(data);

D = size(data,2);    % data dimension    

% inital kmeans step used to initialize EM

K = 3;               % number of mixtures/clusters

cInd = kmeans(data, K, 'EmptyAction','singleton');

% fit a GMM model

gmm = fitgmdist(data, K, 'Options',statset('MaxIter',1000), ...

    'CovType','full', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);

% means, covariances, and mixing-weights

mu = gmm.mu;

sigma = gmm.Sigma;

p = gmm.PComponents;

% cluster and posterior probablity of each instance

% note that: [~,clustIdx] = max(p,[],2)

[clustInd,~,p] = cluster(gmm, data);

tabulate(clustInd)

% plot data, clustering of the entire domain, and the GMM contours

clrLite = [1 0.6 0.6 ; 0.6 1 0.6 ; 0.6 0.6 1];

clrDark = [0.7 0 0 ; 0 0.7 0 ; 0 0 0.7];

[X,Y] = meshgrid(linspace(mn(1),mx(1),50), linspace(mn(2),mx(2),50));

C = cluster(gmm, [X(:) Y(:)]);

image(X(:), Y(:), reshape(C,size(X))), hold on

gscatter(data(:,1), data(:,2), species, clrDark)

h = ezcontour(@(x,y)pdf(gmm,[x y]), [mn(1) mx(1) mn(2) mx(2)]);

set(h, 'LineColor','k', 'LineStyle',':')

hold off, axis xy, colormap(clrLite)

title('2D data and fitted GMM'), xlabel('PC1'), ylabel('PC2')

image

Hope this answer helps.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...