 #### Can Güney Aksakalli

Notes from my journey

k-means clustering is a method to partition $$n$$ observation into $$k$$ clusters where $$k \ll n$$. Each observation $$x$$ belongs to a cluster $$C_i$$ with the nearest mean $$\mu_i$$ compared to other clusters $$\mu_l$$:

$C_i = \{ x \in X | \| x-\mu_i \|^2 \leq \| x-\mu_l \|^2 \}$

The objective function is to minimize the sum of squared distances of each observation to its cluster mean.

$\operatorname*{argmin}_{\mu_i,...,\mu_k} E(k) = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j-\mu_i \|^2$

Since $$E(k)$$ has numerous local minima, there is no algorithm known today to guarantee an optimal solution.

## Lloyd’s algorithm

It is most widely known algorithm to solve k-means clustering and sometimes it is mistaken for the method itself. However, it is:

• Sensitive to the initialization of the means
• Promise nothing about quality, hard to decide about number of iterations

## Hartigan’s Algorithm

• Converges quickly
• Sensitive to initial random class assignment of points.
• First animation • Second animation with different initialization ## MacQueen’s Algorithm

• Convenient for streams
• Sensitive to the order of stream and the initialization of the means
• First animation • Second animation with shuffled data [code] [presentation]

References

C. Bauckhage. “NumPy/SciPy Recipes for Data Science: Computing Nearest Neighbors.”