In last week’s post, I described the DBSCAN clustering algorithm, which uses the notion of density to determine which data points in a data set form tightly packed groups called clusters. This algorithm relies on two parameters – a distance value d and a density value K – and we noted that we can think of these parameters as determining what “scale” we want to look at the data at. For larger values of d, we can think of DBSCAN as looking at the data set from far away so that points that are relatively close to each other all blur together. For small values of d, we can think of the algorithm as looking closely at the data so that all the gaps between data points look really big (relative to d). In this situation, seeing more detail can be a problem, since we don’t want to miss the forest for the trees. In order for DBSCAN to give us a meaningful collection of clusters, we need to choose these parameters carefully. Ideally, we would want to look at any given data set across a range of scales and this is a common theme in the field known as topological data analysis. In this post, I’ll describe an algorithm call Mapper that comes from this perspective, and which is behind the software developed by the topological data analysis startup Ayasdi.
We’ll start off by returning to the basic clustering algorithm I introduced last week: Recall that this algorithm works by starting with a parameter d and forming a graph with edges between any two data points whose distance apart is less than d. The clusters returned by the algorithm are the connected components of this graph. These clusters can change drastically if we give it a different value. But if there are clusters that show up for many different values, this would tell us that whatever structure was defining these clusters could be seen at a lot of different scales, and was thus significant. Of course, we can’t try every single value of d (since there are infinitely many of them) but the basic clustering algorithm is fast enough that we can pick a (finite) number of different values and try them all.
The question is, can we organize the results in a way so that we can understand how the different clusters are related? To see how to do this, we have to consider what happens when we increase or decrease d. Recall that the basic clustering algorithm implicitly creates a distribution representing the data set, by placing a ball of radius d around each data point, as in the Figure below. (This is shown in 2D. As always, this is really happening with higher-dimensional balls in a higher dimensional space.) The clusters are then defined by the connected pieces of this distribution. If we increase d then these balls will get larger. Balls that already overlap will continue to overlap, but some balls that were separated might bump into each other. If this happens then two clusters defined by the smaller value of d will become a single cluster for the larger value. In the Figure, this happens twice: First when the two single points on the left join together from step one to step two. Second, when the two large clusters combine between step three and step four.
In other words, as we turn the dial that controls d, the balls increase in size, so the clouds of foam that they create gets bigger and bigger. When two clouds bump into each other, they merge together. So as d increases, each cluster gets bigger and the number of clusters gets smaller. In order to organize these changes (without being able to actually see what’s happening in higher dimensions) we use what’s called a dendrogram.
First, we pick the values of d that define the different scales that we want look at. For simplicity, lets say we’ve picked the values 1, 2, 3, 4 and 5. We’ll mark these values along the y-axis. Next, we’ll calculate the clusters at the first value of d by drawing edges between any two data points that are distance at most 1 apart. For each of the resulting clusters, we’ll draw a point in our diagram at height 1, but we’ll space them out left to right so we can see them all. Then we’ll do the same thing for d = 2, drawing the new points at height 2 and so on.
For small values of d, there will be a lot of points along the horizontal line, while for larger values of d, there will be only a few. In fact, if we pick d to be large enough so that all the data points are connected, there will be just one. As noted above, each cluster for a given value of d is contained in a cluster for the next highest value; It may be the entire cluster at the next step, or it may have merged with some other clusters. Either way, we will connect each point corresponding to a cluster to each point at the next level corresponding to the cluster that contains it. So each point at height 1 will be connected to some point at height 2. Each point at height 2 will be connected to exactly one point at height 3, but may be connected to more than one point at height 1. The dendrogram that comes from the data set and the values of d in the first Figure is shown on the right. The three points on the bottom horizontal line correspond to the three clusters on the left of the first Figure and so on.
This tree captures the basic information about how the clusters merge as we increase d. In particular, if we have a very long line of edges without any branches, as on the right in the dendrogram above, then this corresponds to a cluster that doesn’t change as we increase d, so as mentioned above, this probably means this cluster defines a significant feature of the data’s geometric structure.
However, as I mentioned last time, the basic clustering algorithm isn’t used very much in practice because it only works well on very homogeneous data (where the density is relatively consistent). The DBSCAN algorithm is a generalization of the basic algorithm that is much more common. But it relies on two parameters instead of one: a distance d and a density K. It would be nice if we could make something like a dendrogram for DBSCAN, that would allow us to understand how clusters change when we vary two parameters instead of just one.
This is the idea behind the mapper algorithm. Mapper is not as commonly used as most of the algorithms I’ve discussed in previous posts, so it hasn’t been implemented in the standard data analysis libraries like R, SciKit and ELKI. But there is a very nice open source implementation in python written by Daniel Müllner and Aravindakshan Babu. As mentioned above, this is also the algorithm behind the (proprietary) software developed by the topological data analysis startup Ayasdi.
In order to understand how mapper works, first note that each edge in the dendrogram we constructed above corresponds to a particular range of lengths in our original data set: Each edge from height 1 to height 2 corresponds to a number of edges that were added when we changed d from 1 to 2, so all these edges have length between 1 and 2. We can therefore construct something very similar to the original dendrogram as follows: First attach all edges of length at most 1 between the data points, calculate the clusters and draw the points on the diagram along the horizontal line at height 1. Now throw away all those edges between the data points and draw in the edges of length between 1 and 2. These will give you new clusters that are much smaller than the old height-two clusters. In fact, there may be a lot more of them as well. But nevertheless, we’ll draw those along the horizontal line at height 1. Then, draw a line from each point at height 2 to each point at height 1 whenever the corresponding clusters share a data point.
If we choose the values of d too close together, the resulting graph may look quite ugly, with too many points at each horizontal level. In practice, though, it seems that in most cases one can choose them so that you get back the original histogram. But the real fun starts when we add a second parameter such as density. There are a number of different ways of calculating the “density” of a data point (or rather of the data set around that point). To keep things simple, you can think of the density as being one divided by the distance to the 10th closest data point (where 10 was chosen arbitrarily). So the farther away the 10 closest neighbors are, the lower the density.
We’ll also choose a number of values for the density parameter, say 1, 1/2, 1/3, 1/4 and 1/5. We can then form a graph much like the second form of the dendrogram above, but using both parameters: First draw an edge between any two points with density at least 1 that are distance less than 1 apart, and draw a point for each cluster. This time, we won’t insist on putting them on a particular horizontal line. Then erase the edges that we just drew between data points and draw in edges between any two vertices with density at least one that are between distance 1 and 2 apart. Draw points for the resulting components and connect them to the points corresponding to the previous round of components whenever two have a data point in common. We’ll then erase the second round of edges from the data points and repeat the process with edges between points density between 1 and 1/2 that are distance less than one apart. (Note how this third round is different from the second round.)
This is getting a bit complicated, but hopefully you see the pattern: We form the components for all possible pairs of ranges for the two parameters and then connect the ones that both have points in common AND have overlapping parameter ranges. The result is a graph that looks kind of like a dendrogram, though it may no longer be a tree. Because this graph takes into account both the distances between points and the densities of points, this graph will usually capture much more information about the data set. In particular, it is more likely to see the less dense part of the data set that both DBSCAN and the basic clustering algorithm miss. Note that mapper doesn’t give you particular clusters the way that DBSCAN doesn, but it does give you a nice visualization of the data that includes cluster-like information.