Can Güney Aksakalli

Can Güney Aksakalli

Notes from my journey

k-Means Clustering Algorithms

k-means clustering is a method to partition \(n\) observation into \(k\) clusters where \(k \ll n\). Each observation \(x\) belongs to a cluster \(C_i\) with the nearest mean \(\mu_i\) compared to other clusters \(\mu_l\):

\[C_i = \{ x \in X | \| x-\mu_i \|^2 \leq \| x-\mu_l \|^2 \}\]

The objective function is to minimize the sum of squared distances of each observation to its cluster mean.

\[\operatorname*{argmin}_{\mu_i,...,\mu_k} E(k) = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j-\mu_i \|^2\]

Since \(E(k)\) has numerous local minima, there is no algorithm known today to guarantee an optimal solution.

Lloyd’s algorithm

It is most widely known algorithm to solve k-means clustering and sometimes it is mistaken for the method itself. However, it is:

  • Sensitive to the initialization of the means
  • Promise nothing about quality, hard to decide about number of iterations

Hartigan’s Algorithm

  • Converges quickly
  • Sensitive to initial random class assignment of points.
    • First animation Hartigan's algorithm for k-means clustering
    • Second animation with different initialization Hartigan's Algorithm 2

MacQueen’s Algorithm

  • Convenient for streams
  • Sensitive to the order of stream and the initialization of the means
    • First animation MacQueen's algorithm for k-means clustering
    • Second animation with shuffled data MacQueen's Algorithm 2

[code] [presentation]


C. Bauckhage. “NumPy/SciPy Recipes for Data Science: Computing Nearest Neighbors.”