We’ve now seen a number of different clustering algorithms, each of which will divide a data set into a number of subsets. This week, I want to ask the question: How do we know if answer that a clustering algorithm gives us is any good? As always, since most data sets will be in higher dimensional data spaces, we usually won’t be able to just look at the data set to see how separated the clusters are. For supervised learning, we can assess the quality of a model by splitting the data into a training set and an evaluation set. This will tell us, for example, if we are guilty of overfitting. For clustering, however, we often won’t even know if there should be observable clusters, let alone how many there should be. Recall that the goal of clustering is to find sets of data points that are much more densely packed than the data set as a whole. So we need a way to decide whether a given subset really is both internally dense and separated from the rest of the data set.
If we use a modeling method like K-means or Gaussian mixtures to define our clusters, this is relatively straightforward: The model assumes that each cluster of data points is sampled from a Gaussian blob and the model consists of the centroids of these blobs. If we’re using a Gaussian mixture, the model will also include the variance and covariance values for each blob. For K-means, we can calculate the variance by adding up the squares of the distances from a given centroid to each data point associated with it, then taking the square root.
The variance tells us, roughly speaking, the radius of each Gaussian blob. This is indicated by the length of the green line on the left in the Figure below. If we have two overlapping Gaussian blobs, i.e. blobs whose variance/radius values are much larger than the distance between their centroids, as on the right in the Figure, then when we sample points from them, it will look very similar to what we would get from a single elongated Gaussian blob. So, if the K-means algorithm gives us two centroids whose variances are much larger than the distances between them, then we would surmise, even without being able to directly look at/visualize the data, that the two Gaussian blobs should really be replaced by a single cluster.
So, this gives us a very simple quality measure for the clusters discovered by K-means: For each of the resulting centroids, we divide the variance of the points associated to it by the distance to the closest other centroid. If this number is very small, it indicates that the cluster is tightly packed relative to the rest of the data set, like we wanted. We can determine the quality of the overall set of clusters by either taking the average of these values over all the centroids, or taking the maximum value. Since the results of both K-means and Gaussian mixtures depend on both the number K of clusters and on the initial starting points, this value is often used to compare different sets of clusters and different numbers of clusters returned by the algorithm.
For example, one might run K-means 100 times for each value of K from 1 to 10, using randomly chosen initial centroids each time. That produces 1000 different possible sets of clusters. Since we only want a single set of clusters, we can measure the quality of each set of clusters as above and choose the one with the lowest value. Note that this gives us a fairly sound way of choosing the value of K, which we otherwise would have had to either guess, or choose based on domain knowledge, etc.
But if we’re interested in more general types of clusters than just Gaussian blobs, such as the ones shown on the left, things aren’t so simple. Since clusters may be elongated and curved, the centroid/center of mass of cluster may not be a good representative of the clusters. In fact, it’s even possible that the centroid of one cluster will be contained in another cluster, as with the cluster circled in blue. This example may seem far fetched, but most real world data sets aren’t made up of Gaussian blobs, so only looking at centroids and variances isn’t enough.
In other words, we need measures that do not assume the data fits a specific model. For example we might try to measure the density inside a set of clusters compared to the density between clusters. This is more or less what the DBSCAN algorithm does, except that as we noted in the previous post, DBSCAN requires setting a hard cutoff for the density, then calculates the clusters accordingly. What we’re looking for is a way of comparing the inside and outside densities for a given set of clusters.
The key to this problem is finding a meaningful baseline. For the variance/distance measure for the Gaussian blobs above, we are comparing two distances, so the number that we come up with will be comparable between data sets. If one data set has a lower value than the other, then we could legitimately say that it has more tightly packed clusters. In particular, if we scale the data set (i.e. multiply every vector by the same number) then this ratio won’t change. So, we want a similar measure for general clusters.
The normalized mean cut from last week gives us a good way of comparing clusters within a data set, but it does not give us an objective baseline to compare clusters between data sets. Essentially, this is because we’re comparing area (the cross section of the bottleneck) to volume (the number of vertices or edges on either side of the cut). For example, in the Figure below, each of the two graphs can be cut along three edges to form two clusters of seven vertices. So the best (normalized) cuts in both graphs have the same areas and volumes. However, while this picks out an obvious bottleneck in the graph on the left, the graph on the right doesn’t have a bottleneck at all. If we were to run spectral clustering on these graphs, and calculate the normalized mean cut values, we would get the same number for both of them, so it wouldn’t tell us whether or not there is a bottleneck.
One alternative measure that is much closer to the idea of comparing densities inside and outside a subset of a graph is what’s called the modularity. This idea is popular in network analysis, though it doesn’t seem to have filtered into data mining/machine learning quite as much. The idea is to compare the number of edges inside a cluster to the number of edges we would expect to find between similar vertices in a randomly constructed graph that has that same overall number of edges as our original graph. If there are a lot more edges in our cluster, that indicates the cluster is more densely packed than the rest of the graph. But the question is, how many edges would we expect to find between a similar collection of vertices in a similar graph with the same number of edges?
We can think of the expected number of edges as an expected value in the sense of probability: We consider all possible outcomes and average them, weighted by how likely each outcome is. Here, the possible outcomes are the possible graphs that we can get by connecting the vertices of our given graph in different ways. We’ll want to take into account the number of edges that initially go into each vertex, i.e. its valence, since this will make a big different in our expectation value.
So, given our initial graph, we’ll record the number of edges that go into each vertex. Then we’ll look at all the different ways of connecting the vertices by edges, so that each vertex ends up with the same number of edges going into it as it had originall. For each of these graphs, we’ll count the number of edges between the vertices of our original cluster. Then we’ll take the average of all these numbers and compare it to the actual number of edges in our cluster.
In practice, we don’t actually have to construct every possible graph in order to calculate the expected number of edges; there’s a relatively simple formula that will tell you the number we would get if we did form all possible graphs. This leads to a fairly simple formula for the modularity, which you can read about on its wikipedia page.
The important part is that modularity gives us a good measure of how well separated a given cluster is from the rest of the data set. It makes a very good analogue of the distance vs. variance measure for the clusters found by K-means, except that modularity will work on general clusters, rather than just Gaussian blobs. One thing to note is that modularity is measuring something closely related to the relative density of each cluster, rather than the bottlenecks that spectral clustering was looking for. Next week, I’ll introduce a way of measuring clusters (and an accompanying clustering algorithm) that gets much closer to the idea of measuring bottlenecks.