Can Güney Aksakalli

Can Güney Aksakalli

Notes from my journey


k-Means Clustering Algorithms

k-means clustering is a method to partition observation into clusters where . Each observation belongs to a cluster with the nearest mean compared to other clusters :

The objection function is based on optimizing the sum of squared distances of each observation to its cluster mean.

Since has numerous local minima, there is no algorithm known today to guarantee an optimal solution.

Lloyd’s algorithm

It is most widely known algorithm to solve k-means clustering and sometimes it is mistaken for the method itself. However, it is:

  • Sensitive to the initialization of the means
  • Promise nothing about quality, hard to decide about number of iterations

Hartigan’s Algorithm

  • Converges quickly
  • Sensitive to initial random class assignment of points.
    • First animation Hartigan's algorithm for k-means clustering
    • Second animation with different initialization Hartigan's Algorithm 2

MacQueen’s Algorithm

  • Convenient for streams
  • Sensitive to the order of stream and the initialization of the means
    • First animation MacQueen's algorithm for k-means clustering
    • Second animation with shuffled data MacQueen's Algorithm 2

[code] [presentation]

References

C. Bauckhage. “NumPy/SciPy Recipes for Data Science: Computing Nearest Neighbors.”