# k-Means Clustering Algorithms

*k*-means clustering is a method to partition \(n\) observation into \(k\) clusters where \(k \ll n\).
Each observation \(x\) belongs to a cluster \(C_i\) with the nearest mean \(\mu_i\) compared to other clusters \(\mu_l\):

The objective function is to minimize the sum of squared distances of each observation to its cluster mean.

\[\operatorname*{argmin}_{\mu_i,...,\mu_k} E(k) = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j-\mu_i \|^2\]Since \(E(k)\) has numerous local minima, there is no algorithm known today to guarantee an optimal solution.

## Lloyd’s algorithm

It is most widely known algorithm to solve *k*-means clustering
and sometimes it is mistaken for the method itself. However, it is:

- Sensitive to the initialization of the means
- Promise nothing about quality, hard to decide about number of iterations

## Hartigan’s Algorithm

- Converges quickly
- Sensitive to initial random class assignment of points.
- First animation
- Second animation with different initialization

## MacQueen’s Algorithm

- Convenient for streams
- Sensitive to the order of stream and the initialization of the means
- First animation
- Second animation with shuffled data

[code] [presentation]

**References**

C. Bauckhage. *“NumPy/SciPy Recipes for Data Science: Computing Nearest Neighbors.”*