In the last few weeks, we’ve looked at a number of different algorithms for dividing a data set into clusters, and a few ways of measuring how good the clusters found by such an algorithm are. This week, I’m going to introduce a new approach inspired by closely related problems in abstract geometry and topology. (So, this fall under Topological Data Analysis.) As a disclaimer I should point out that while most of the material that I’ve covered so far on this blog is fairly mainstream and well established, today’s post will cover techniques that have come out of my own (recent) research, so they’re not mainstream (yet.) This week I’m going to introduce a measure of cluster separation that gets closer to the idea of measuring bottlenecks, and an algorithm (called TILO/PRC) for finding clusters that get a good score under this metric.
When I introduced the notion of a bottleneck a few weeks back, I suggested the image of a highway that shrinks from many lanes down to just a few, forcing traffic to slow down. But we can also think of it in the more literal sense, as a part of a bottle that’s relatively thin compared to the rest of the bottle. An old fashioned coke bottle, like the ones shown on the right, has a bottleneck near the top (i.e. at the neck of the bottle) as well as one near the bottom, where it gets slightly thinner just above the waist. If we wanted to describe the shape of the bottle to someone who had never seen one before, we might say that it looks like two blobs (one at the base and one in the middle) squished together, with a third very small blob added on at the top. Of course, this is not the only way to describe the shape, but it demonstrates how the two bottlenecks define a natural decomposition of the coke bottle.
We can think of a clustering algorithm as trying to do the same thing with a data set, i.e. to find the separations where it naturally decomposes into a small number of blobs. As always, the problem is that we can’t actually see the data set. So here’s the question: How could you find the two bottlenecks on a (high dimensional) coke bottle if you couldn’t actually see it?
As it turns out, a very similar problem comes up in a number of areas of pure mathematics, particularly in geometry and topology. As I mentioned a few posts back, the notion of normalized min cut in computer science is closely related to the Cheeger constant in geometry. In knot theory, one often wants to decide if a given knot is made up of a number of simpler knots, like the one on the right, through a construction called a connect sum, which creates a bottleneck very similar to ones in the coke bottle. This problem led to a topological technique called thin position for finding these decompositions. As it turns out, thin position translates very nicely into the realm of data analysis. Here’s the idea:
Lets say that we can’t see our (high-dimensional) coke bottle, but we have a (high dimensional) rubberband that we can slip over the coke bottle and measure how tightly it is stretched. Knowing how long the rubber band is stretched tells us how large the corresponding cross section of the bottle is. We want to pull the rubber band onto the coke bottle, then across from one side to the other. Two ways to do this are shown in the first Figure above: We can stretch the rubber band the long way across the bottle, as on the left, then move it from left to right. Or we can stretch it the narrow way over the top of the bottle, then down to the bottom as on the right. Since the rubber band wants to stay as small as possible, the second of these will be easier, and is probably the way most people would do it.
So lets say we pulled the rubber band across the bottle in the way that stretches the rubber band as little as possible, and kept track of how far it was stretched at each point in time. Then the bottleneck near the bottom would correspond to a moment where the rubber band relaxes (gets smaller) temporarily, then stretches out again. So by reading the tension on the rubber band, we can identify the bottlenecks without being able to actually see the bottle: They’re the places where the cross sections get smaller than larger again. (Each one of these times is called a local minimum.)
We can do the same thing with a data set, or rather with a similarity graph derived from a data set: Instead of pulling a rubber band across the data set, we will number the vertices from 1 to N. You can think of these as instructions for which vertex to pull the rubber band over first, then second and so on. The amount of tension on the rubber band, i.e. the cross section after we slide it over the first K vertices, is the number of edges that go from any of the first K vertices to any of the remaining vertices. In other words, if we slid the rubber band over the graph so that some of the vertices were on one side and some were on the other side, this is the number of edges that have to go through the rubber band. This number is called the width at the given position. (The term thin position comes from the fact that we want to make this width as small as possible.)
If we look at all the widths together, we’ll get an idea of how efficient the given numbering of the graph is. If they’re really high, it means that we chose a bad numbering, the equivalent of pulling the rubber band the long way, like on the left in the first Figure. If we can find a numbering where the widths are lower, then it means that we’ve found a more efficient way, like going from top to bottom on the coke bottle as on the right in the Figure.
A simple example with two different numberings is shown in the Figure below. On each side, the top picture shows the numberings on the vertices and the lower pictures show what the graph would look like if we laid it out so that the vertices appear in this order along a horizontal line. The widths, which correspond to the number of edges crossed by the vertical red lines, are labeled below. As you can see, the ordering on the left has a higher value for the third width than the one on the right (4 instead of 1) but the rest are the same. So we would pick the numbering on the right rather than the one on the left.
In fact, the third with on the right is lower than the widths just before and after it, i.e. a local minimum. So this is like the tension on the rubber band going down, then back up again. In fact, you can see that the middle position on the right separates the two triangles from each other, which form natural clusters. As it turns, this same phenomenon occurs in general: If you find an efficient numbering of the vertices of a graph, then the places where the width decreases then increases will form clusters that satisfy certain very strong theoretical properties that are independent of the ordering. (I call them pinch clusters.)
Now, as with the normalized mean cut and many of the other cluster measures, we still have the problem that we can’t try every single possible numbering of the vertices, since the number of orderings increases too quickly as the number of vertices increases, and there isn’t enough computing power in the world to do this for all but the smallest examples. Instead, we need a way to start with any numbering and try to improve it little by little. For example, if we started out by stretching the rubber band the long way as on the left, we might notice that we could do a little better by moving diagonally instead of horizontally. Then we could slowly rotate the path until we got to the ideal up-down one.
The TILO algorithm does something along these lines, using an idea that comes straight out of the knot theory setting. (TILO stands for Topologically Intrinsic Lexicographic Ordering. I probably need more practice making up acronyms – that was the best I could come up with.) The algorithm starts with a possibly random numbering of the vertices, then looks for simple ways to shift vertices forward and backward in the ordering in order to improve the overall width. The surprising thing is that if you want to find the smallest bottlenecks, you actually need to focus on the places where the width is fattest, and look for ways to reduce those. You can find the details of exactly how the algorithm determines which shifts to make in a paper I wrote in 2012.
You can also find source code, binaries and a paper describing some experiments with specific data sets on my collaborator Doug Heisterkamp’s page. What we’ve found is that in many cases, this algorithm does better than the standard clustering algorithms. As far as run time, it seems to be comparable to spectral clustering – The actual run time depends on the number of shifts you have to make. We don’t have a good bound on this number, but in practice it seems to be proportional to the number of data points squared, and I would conjecture that this could be proved theoretically.
But one benefit that TILO has over many other clustering algorithms is that comes with a built in measure of how well separated the resulting clusters are. Namely, if you compare the local minimum width (where the widths before and after are both higher) to the maximal widths on either side, you get a rough idea of how narrow the bottleneck between two clusters is compared to the rest of the graph. We call the width between two clusters divided by the smaller of the maximal widths in each cluster the pinch ratio. A smaller ratio suggests that the clusters are better separated, and because this ratio compares areas to areas (unlike the normalized min cut, which compares areas to volumes), the ratio can be compared between different data sets/graphs. The PRC part of the algorithm (which stands for Pinch Ratio Clustering) picks out the local minima with the lowest pinch ratio, and cuts at these places to find the final clusters.
As I said at the beginning, this algorithm is still relatively new, and is only the beginning of an ongoing research project. But I think it demonstrates a great deal of potential for how ideas from abstract geometry and topology can be used to not only explain existing methods in data mining, but also to inspire powerful new approaches and algorithms.