k-Means Clustering Algorithms
k-means clustering is a method to partition \(n\) observation into \(k\) clusters where \(k \ll n\). Each observation \(x\) belongs to a cluster \(C_i\) with the nearest mean \(\mu_i\) compared to other clusters \(\mu_l\):
\[C_i = \{ x \in X | \| x-\mu_i \|^2 \leq \| x-\mu_l \|^2 \}\]The objective function is to minimize the sum of squared distances of each observation to its cluster mean.
\[\operatorname*{argmin}_{\mu_i,...,\mu_k} E(k) = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j-\mu_i \|^2\]Since \(E(k)\) has numerous local minima, there is no algorithm known today to guarantee an optimal solution.
Lloyd’s algorithm
It is most widely known algorithm to solve k-means clustering and sometimes it is mistaken for the method itself. However, it is:
- Sensitive to the initialization of the means
- Promise nothing about quality, hard to decide about number of iterations
Hartigan’s Algorithm
- Converges quickly
- Sensitive to initial random class assignment of points.
- First animation
- Second animation with different initialization
MacQueen’s Algorithm
- Convenient for streams
- Sensitive to the order of stream and the initialization of the means
- First animation
- Second animation with shuffled data
[code] [presentation]
References
C. Bauckhage. “NumPy/SciPy Recipes for Data Science: Computing Nearest Neighbors.”