k-Means Clustering Algorithms

k-means clustering is a method to partition \(n\) observation into \(k\) clusters where \(k \ll n\). Each observation \(x\) belongs to a cluster \(C_i\) with the nearest mean \(\mu_i\) compared to other clusters \(\mu_l\):

\[C_i = \{ x \in X | \| x-\mu_i \|^2 \leq \| x-\mu_l \|^2 \}\]

The objective function is to minimize the sum of squared distances of each observation to its cluster mean.

\[\operatorname*{argmin}_{\mu_i,...,\mu_k} E(k) = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j-\mu_i \|^2\]

Since \(E(k)\) has numerous local minima, there is no algorithm known today to guarantee an optimal solution.

Lloyd’s algorithm

It is most widely known algorithm to solve k-means clustering and sometimes it is mistaken for the method itself. However, it is:

Sensitive to the initialization of the means
Promise nothing about quality, hard to decide about number of iterations

Hartigan’s Algorithm

Converges quickly
Sensitive to initial random class assignment of points.
- First animation
- Second animation with different initialization

MacQueen’s Algorithm

Convenient for streams
Sensitive to the order of stream and the initialization of the means
- First animation
- Second animation with shuffled data

[code] [presentation]

References

C. Bauckhage. “NumPy/SciPy Recipes for Data Science: Computing Nearest Neighbors.”

Can Güney Aksakalli

Lloyd’s algorithm

Hartigan’s Algorithm

MacQueen’s Algorithm