kmeans++
The kmeans algorithm is one of the most popularly used algorithms in clustering. However, one of the main problem of the algorithm is the initialization of the seed points. Random initializations is generally used and it can sometimes lead to a clustering arbitrarily worse than the optimal clustering. Further it can cause much slower convergence to the final result. The kmeans++ algorithm comes to our rescue in such cases. The kmeans++ paper was published in 2007 and so far has more than 850 citations. It provides a very cute method to initialize kmeans. The idea is as follows:
Instead of picking up random cluster centres, how about picking the farthest point from the chosen cluster centres ? The problem with this approach is that it will result in picking up outliers as cluster centres. Instead, we want to pick cluster centre from the next dense group of points. So here are the two steps for the algorithm -
Useful Links:
Instead of picking up random cluster centres, how about picking the farthest point from the chosen cluster centres ? The problem with this approach is that it will result in picking up outliers as cluster centres. Instead, we want to pick cluster centre from the next dense group of points. So here are the two steps for the algorithm -
- Pick the first cluster center at random.
- Choose the subsequent cluster centres from the remaining data points with a probability proportional to the squared distance from the points closest existing cluster centre.
[sz dim] = size(X); C(1,:) = X(randi(sz),:); L = ones(sz, 1); for i=2:k D = sqrt(sum((X - C(L,:)).^2, 2)); PDF = cumsum(D); PDF = PDF/PDF(end); C(i,:) = X(find(rand < PDF,1),:); for j=1:i temp(j,:) = sum((X - repmat(C(j,:), sz, 1)).^2, 2); end [val, L] = min(temp); end
Useful Links:
Comments
Post a Comment