sobota 12. listopadu 2011

How to initialize the centers of models in GMM

Today I have been dealing with a big problem which is the initialization of the centers of new clusters in GMM. Other topic is how to find the optimal number of clusters.

Many approaches to these problems have been proposed, but any of them is the best one. As the others also I have been thinking for a while what could be the best way to find the right number of clusters. I have developed some algorithm based on the likelihood function, which I'd like to deal with you. I know it isn't optimal, but it could be a direction if not the way.

On this address you can see the video with one of my first trials.
http://www.youtube.com/watch?v=ZBKbTAmVT2k

The more difficult example:
http://www.youtube.com/watch?v=UuHHpJjyPIE


What can be seen on this video are few problems which we have when programming some automatic algorithm for GMM. It tends to overfit the data - which means that more clusters are found that would we like to. Another problem is that when the data are not gaussian and we have only few of them then it is obvious that the algorithm wouldn't work as well as we would liked to.
So there are few tasks for every researcher - find out some tradeoff between the log-likelihood which is decreasing when the models are added and the number of models - we prefer less models.

The algorithm is:
1.initialize first cluster in the mean of data
2.run EM algorithm (GMM) to find optimal mean of the model/s and covariance matrices
3.for each datapoint i look for the highest likelihood among models and save it to the vector lm(i)
4.make the histogram of lm (there is the problem how to find the best partitioning of the histogram)
5.find local maxima in histogram
6.exclude the maxima which are relevant to the already placed models
7.from the resting maxima find the one with the biggest number of datapoints
8.from these datapoints find the one which has the densiest neigbrhood and set this point as the center for the new cluster
9.run 2-8 until the dependency between new cluster and some other cluster is higher than some threshold

Further reading:
Initialization of GMM models:
Very helpful was the article from Lee&Lee&Lee: The Estimating optimal number of GMM based on incremental k-means for speaker identifications

Other articles about the optimal number of clusters and initialization centers of models:
Selecting the optimal number of components for GMM
An algorithm for estimating number of components of GMM based on penalized distance
Number of components and initialization in Gaussian Mixture Model

Žádné komentáře:

Okomentovat