K-means

The subject of this weeks post is probably one of the most polarizing algorithms in the data world: It seems that most experts either swear by K-means or absolutely hate it. The difference of opinion boils down to one of the fundamental questions in data analysis: Is it more important for algorithms to be based on a firm theoretical foundation, to work in a wide range of settings and to have provable accuracy guarantees, or is it more important for algorithms to be fast, simple and to work pretty well in the situations where you need them? K-means definitely fits the latter of these descriptions, so as a mathematician I was initially one of the haters. However, I’ve recently come to terms with K-means and, as I’ll describe below, my current view is that K-means can actually be very effective when used in certain ways that it wasn’t necessarily designed for.

The basic idea behind the K-means algorithm is essentially a very restricted mixture model: We select a number K and create a mixture model with K simple models, each of which is a Gaussian blob. However, unlike in general mixture models where we allow each Gaussian blob to be a different size and to be stretched in different directions, as on the left in the Figure below, K-means requires that each blob be a fixed size and completely symmetrical (so no stretching), as on the right. You can imagine the sorts of complaints that this would face: Most data sets aren’t made up of symmetric Gaussian blobs of the same size. In fact, unless you miraculously scale your data perfectly, you’re pretty much guaranteed that it won’t be. However, if the data is made up of well separated clumps, even if those clumps aren’t perfectly round and all the same size, K-means will often (though not always) be able to find them. (These “clumps” are often called clusters, but I’m going to leave a general discussion of clusters for a later post.)

So when you use K-means as a substitute for a mixture, you’re taking a risk that the final answer will be pretty lousy. The trade-off for taking this risk is that the algorithm is much much faster than training a general Gaussian mixture model. K-means still follows the two-step Expectation Maximization outline of a mixture model, but each step has been optimized to the point that you can barely tell where it came from.

First, note that since each of the Gaussian blobs in our mixture model is the same exact size and shape, the only information we need to record about each blob is its center. So our “model” will consist of K points spread throughout the data space. Recall that the first step in the Expectation Maximization algorithm is to assign each data point to one of the simple models (i.e. one of the Gaussian blobs) based on which one assigns it the highest probability. The probability that one of the Gaussian blobs assigns to a given point goes down as we get farther away, and all the blobs are identical. Thus each data point will simply be assigned to whichever center point is closest to it. That’s step one.

To understand what’s happening here, consider the data points shown in black in the figure below. We have a smaller number of blob centers, shown in blue. When we assign each data point to its closest blob center, this should remind you of the nearest neighbor algorithm. In the post on nearest neighbors, we noted that the set of all points that were closer to a given data point than to any other data point form what’s called a Voronoi cell around each data point. For K-means, we’ll get something very similar, except that the Voronoi cells are around the blob centers rather than data points. The Voronoi cells are outlined in blue on the left of the Figure. Each data point will be in the Voronoi cell around one of the blob centers, and we will assign it to this blob. For step two, we need to adjust each of the Gaussian blobs to maximize the probability of all the points assigned to it. In general mixture models, this involves repeatedly adjusting the parameters using the notion of a gradient. However, if we’re not allowed to stretch the Gaussian blobs then we can immediately figure out exactly where to put them: We’ll want the point where we get the smallest possible number when we add up the distances squared from the center point to each of the data points assigned to it, i.e. all the data points in a given Voronoi cell.

In other words, if we take all the data points in one of the Voronoi cells and think of these data points as heavy balls connected by bars that hold them in place, we will want to move the center point of the Gaussian blob to the center of gravity of this structure. This center of gravity is called the centroid. As it turns out, the center of gravity is straightforward to calculate – you just take the average of the data values in each dimension. So for step two we move each Gaussian blob so that it is centered at the centroid of the points that were assigned to it in the first step. These centroids are shown in red in the center Figure.

Then we repeat – We go back through the data points and reassign them based on the new center points of the Gaussian blobs. (In other words, we form new Voronoi cells around the red points.) Then we calculate the centroids of each of the new subsets of data points (the points in each new Voronoi cell), then reassign the data points and so on.

After we repeat these two steps enough times, we’ll get to a point where step one assigns the data points to centers in exactly the same ways as the previous time that we ran it. So there won’t be a need to calculate the centroids again (since we’ll just get the ones we already have) and the K-means algorithm will terminate. In the Figure above, these would be something like the blue points shown on the right. It’s also common to run through the two steps a fixed number of times, rather than running it all the way to the point where the centroids stop moving, hoping that this will give you a reasonably good approximation of the final answer.

If the data happens to be separated into a certain number of well defined, tightly packed clumps/clusters and you happen to select K to be equal to this number then there’s a reasonable chance that the blob centers that the algorithm ends with will be the centers of these clumps/clusters. As always, we usually won’t be able to directly “see” whether or not this is the case, since most data has more than three dimensions. However, there are a number of measures that have been devised to measure how good K-means has done.

The most common one is to compare the distances between the final center points to the average distance from each data point to the center that it is assigned to. If the first of these numbers is much higher than the second then that means that K-means has found well separated clusters. Since the quality of the final answer depends heavily on picking the right value of K, this is often used to decide which K to use: Simply run K-means with different values of K and choose the one in which the difference between the two numbers is as large as possible.

Of course, this doesn’t solve the problem of what to do when our data set is not composed of a small number of well separated clumps. One of the main criticisms of K-means is that many (perhaps most) interesting data sets tend to have more complicated structures, so it would be naive to try using K-means on them. But like I said at the beginning of this post, I’ve now come to like K-means, nonetheless, by looking at it from a different perspective. In particular, I want to suggest using K-means for an off-label usage very similar to a technique from signal processing called vector quantization. This use of K-means has been suggested by others in the past, but doesn’t seem to be widespread. (It seems to be better known within the signal processing community, since that’s where the idea of vector quantization is well understood, but not as much within machine learning and data mining.)

Notice that the data set shown in the Figure above is not made up of a small number of round Gaussian blobs. In fact, it isn’t even made up of stretched Gaussian blobs, since one of clusters/clumps of points forms a curved shape. But you can see that the final blob centers found by K-means still reflect the structure of the data set, just in a more subtle way: We end up with three blob centers within the round cluster in the lower half of the data set and the rest of the blob centers forming an arc that follows the curved cluster.

To understand why this happens, note that each step in the K-means algorithm reduces the average distance squared from the data points to the blob centers. In other words, we can think of K-means as trying to place the blob centers so that each data point is as close as possible to some blob center. This has the effect of spreading the blob centers evenly around the data set, so the “shape” of the set of blob centers will be very similar to the overall shape of the data set.

So the blob centers found by K-means are very similar to a random sample of the original data set. However, there are two important differences: First, unlike a random sample, the blob centers are not in the original data set. One slight modification would be to take a sample of the original data set by choosing the data point closest to each blob center. Second, the blob centers are not random (though they do depend on the initial choice of centers, which are often chosen randomly.)

The blob centers have two important advantages over the original data: First, there are fewer of them, allowing one to run more sophisticated algorithms in a reasonable amount of time. Since K-means on its own is very fast, this would be a good trade-off. Second, the set of center points are less random than the original data set. In data analysis it’s common to think of randomness as being a good thing: We want random samples of our data to make sure it’s evenly distributed, and of course the field of Statistics is devoted to understanding all the facets of randomness. However, from geometric perspective the problem with random samples is that by their nature, they contain large gaps that can be mistaken for geometric features. This is a fundamental principle of randomness, and in fact you should be suspicious of any “random sample” that does not have both many large gaps between points and many points that are very close together.

One idea that has come out of numerical analysis (a field within mathematics) is that a homogeneous sample is often better than a random sample for large scale simulations. In other words, it’s often good to have data points that are “evenly” spread out within the probability distribution, but with more concentration in the darker parts of the density function. If we choose the number K of blob centers to be relatively large (though still much smaller than the number of data points) then the final set of points that K-means gives us should have this property: It will be relatively evenly spread out, but still have the rough shape of the original data set. So the set of blob centers can be used for the types of more computationally expensive forms of exploratory data analysis (which we’ll discuss in upcoming posts) whose goal is to understand the overall structure of the data well enough to decide what types of models and what types of kernels to use.

Another possible use for the blob centers is to use them as the anchor points for a Gaussian kernel. Recall from last week’s post that a Gaussian kernel is defined by choosing a small number of points in the data space. We want these points to be fairly evenly spread out, but close to the original data set. If we use all the original data points for make the Gaussian kernel then we are more or less guaranteed that our model will suffer from over-fitting. Thus the blob centers will often be a much better option for defining a Gaussian kernel.

So, that’s my take in K-means: If you think of it as a clustering algorithm, or as a substitute for a Gaussian mixture model, then you may be asking for trouble. However, I think that K-means has a great deal of unexplored potential as a quick way to reduce the data set to a relatively small number of representatives for us in exploratory data analysis or forming a Gaussian kernel.