# K-Means Clustering

Given a set of input $$x_i$$, K-means clustering finds a set C of cluster
centers that minimizes the inertia, defined as $$ \sum_{i=0}^{n} \min_{\mu_j \in
C} (||x_i - \mu_j||^2) $$.

![kmeans](kmeans.png)

The **K-means** algorithm, or **Lloyd's algorithm**, is an iterative approach
where each iteration involves assigning each input point to its closest cluster,
followed by recomputing the new value of the cluster center as the mean of the
points assigned to it. It runs full-batch, which means each iteration goes
through the full set of inputs to compute a new set of cluster centers.
Convergence is typically pretty fast, in terms of number of iterations, but
there can be local minima.

**[Mini-batch K-means](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf)**
is a stochastic variant of Lloyd's algorithm which updates the cluster centers
based on a mini-batch of data. It converges much faster than the full-batch
version, especially for large data sets, and the quality is typically close to
the full-batch version.

**[K-Means++](http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf)** is an
approach for computing the initial cluster assignments that is expensive but is
typically less prone to getting stuck in bad local minima.

**[k-MC2](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12147/11759)**
provides a very fast seeding method that provides high quality centers
comparable to K-Means++ seeding. k-MC2 works particularly well if it is combined
with Mini-batch K-Means.

We provide distributed implementations of both full-batch and mini-batch K-Means
algorithm. K-Means++, k-MC2 and random initialization are supported. The user
can also choose between **Cosine** and **Squared Euclidean** distance metrics.