kmeans++

The kmeans algorithm is one of the most popularly used algorithms in clustering. However, one of the main problem of the algorithm is the initialization of the seed points. Random initializations is generally used and it can sometimes lead to a clustering arbitrarily worse than the optimal clustering. Further it can cause much slower convergence to the final result. The kmeans++ algorithm comes to our rescue in such cases. The kmeans++ paper was published in 2007 and so far has more than 850 citations. It provides a very cute method to initialize kmeans. The idea is as follows:
Instead of picking up random cluster centres, how about picking the farthest point from the chosen cluster centres ? The problem with this approach is that it will result in picking up outliers as cluster centres. Instead, we want to pick cluster centre from the next dense group of points. So here are the two steps for the algorithm -
  1. Pick the first cluster center at random.
  2. Choose the subsequent cluster centres from the remaining data points with a probability proportional to the squared distance from the points closest existing cluster centre.
The authors have shown theoretical guarantees that the algorithm will perform well. They show emperically that the algorithm results in much lower computational time and in considerable improvement in the final error of kmeans. The matlab code is as follows:
 
[sz dim] = size(X);
C(1,:) = X(randi(sz),:);
L = ones(sz, 1);
for i=2:k 
   D = sqrt(sum((X - C(L,:)).^2, 2)); 
   PDF = cumsum(D);  
   PDF = PDF/PDF(end); 
   C(i,:) = X(find(rand < PDF,1),:); 
   for j=1:i
       temp(j,:) = sum((X - repmat(C(j,:), sz, 1)).^2, 2); 
   end
   [val, L] = min(temp); 
   
end

Useful Links:
  1. Kmeans++ paper
  2. Superb video lecture by the author

Comments

Popular posts from this blog

MinHash, Bloom Filters and LSH

Perceptron Algorithm

Logistic Regression