Spectral clustering

In the last few posts, we’ve been studying clustering, i.e. algorithms that try to cut a given data set into a number of smaller, more tightly packed subsets, each of which might represent a different phenomenon or a different type of data point. The algorithms in the last two posts focused on using the “density” of a data set to construct a graph in which each connected component was a single cluster. This method is very effective for data sets in which the clusters are sufficiently separated, and relies on picking one or more parameters that determine the “scale” at which to examine the data. However, there are cases where it is not possible to arrange things so that clusters of the data set naturally produce separate components of a graph. So this week we’ll look at ways of finding clusters in a connected graph. The main algorithm that we’ll look at, spectral clustering, comes from a somewhat surprising connection to physics, and while it may sound intimidating at first, we’ll see that it’s really not so bad.

But first, lets take a minute to think about what a cluster would look like in a graph. Remember that we want to think of the graph as representing a probability distribution made up of a number of different connected shapes, as in the Figure below. Here there are clearly at least three different clusters, though the lower two overlap slightly. We get the data set in the middle by randomly picking points, such that the points in the darker parts of the probability distribution are more likely to get picked. Then we form the graph on the right by connecting nearby points by edges, without knowing what the shapes that made up the distribution were. If these shapes overlap slightly, as the bottom two clusters do, then we will be forced to connect points from the different clusters and the graph that we get will be connected.

clusters

However, as in the picture, the places where the clusters overlap will often be thinner than the in the middle of the clusters. In other words, these spots are bottlenecks: If we were doing the morning commute between edges in the graph, we would be able to tell roughly where the transition between clusters is, because that’s where traffic would be the slowest. That’s where lots of lanes would merge into just a few.

This might be more intuitive in terms of social networks. Recall that a social network defines a graph in which each vertex is a person and edges connect any two people who “know each other” (i.e. are facebook friends, twitter followers, etc.) The clusters in a social network are communities: Any two people within a community are very likely to know each other, so there will be lots of edges within each community. Most people in a community will also know people outside of the community, so there will be edges between different communities, but not as many of them. (This is complicated by the fact that people can belong to multiple communities, but lets not worry about that for now…)

In a social network, the traffic is news, opinions, information, cat pictures, etc. Information spreads quickly within each community because there are lots of connections. It also flows between communities, but much more slowly, since there aren’t as many people who connect different communities and they can only transmit a limited amount of information. In other words, the bottlenecks in social networks are what reduce communication between different communities/clusters.

So in order to look for clusters in connected graphs, we need to define exactly what we mean by a bottleneck. In general, given any two vertices in a graph, there will be a lot of different paths that the traffic between them can take. So it’s difficult to find a single edge with a lot of traffic. But we can find sets of edges that, when removed, will separate the graph into two pieces/components. Any traffic between those components will have to go across one of the edges, though we don’t know which one.

To make things simple, lets pretend that there is an equal amount of traffic from each vertex in the graph to any other vertex. If we choose edges that only cut off a few vertices from the rest of the graph, such as the purple edges on the left of the Figure below, then there won’t be much traffic that wants to get across the edges, and they won’t make a bottleneck. (Think of the bridge to nowhere in Alaska.) On the other hand, if there are a lot of vertices on both sides of the edges, as on the right of the Figure below, then the edges will have a large amount of traffic. (Think of the bridges that divide Manhattan from the rest of New York.)

cheeger

So in order to find a bottleneck in the graph, we want to find a relatively small number of edges that divide the graph into two pieces, each of which has a lot of vertices. There are a few different names for such a set of edges, each of which has a slightly different meaning and is used by different people. If E is the number of edges and N, M are the number of vertices on each side then the min cut max flow (which comes from optimization theory) is the set of edges that minimizes E/NM. Note that NM is the number of pairs of vertices on opposite sides of the cut. The normalized mean cut (which comes from computer science) is the set of edges that minimizes E/N + E/M (though, sometimes N and M are the number of edges on either side, rather than the number of vertices.) In mathematics/geometry, the Cheeger constant is the minimum value of E/N, for N < M. All of these terms are sometimes used interchangeably, or with slightly different meanings.

In practice, the small differences between these terms doesn’t matter because there is no computationally efficient way to compute any of them. Essentially, the only way to find the set of edges that achieves the min cut max flow/normalized min cut/cheeger constant is to try every possible combination, which makes the problem NP complete. (In other words, it’s impractical to run it on any data set bigger than perhaps a few hundred points.)

Instead, the usual approach is to approximate them, i.e. to find a set of edges that has a pretty good value for E/NM or E/N + E/M. Since we only need to get a rough idea of where the clusters are, this will be good enough and is computationally more efficient. This is where the physics comes in.

When you play a stringed instrument like a guitar, the sound that you hear is caused by each string vibrating in a very specific manner determined by its weight, length, tension, etc. The pattern of vibration is very complicated, but it’s actually made up of a small number of very simple patterns, namely sine waves with different wavelengths, as on the left in the Figure below. The longest one has a single hump in he middle. The next longest has two humps (one above and one below), then there’s one with three humps and so on. The frequency at which the humps move up and down depends on how many there are, and thus how short they are. The wave with one hump has the lowest frequency. The one with two humps has twice that frequency and so on.

waves

Something very similar happens when you hit the head of a drum: The drumhead vibrates, with different points moving up and down in some complicated pattern. Again, this pattern is made up of a number of relatively simple patterns, moving at different frequency, but that look complicated when you combine them. The patterns in these waves are determined by the shape of the drum. Most drumheads are round, but there are also square drums out there, and in theory you could make a drum in any shape that you want. Most importantly, you could even build drumheads in shapes that have bottlenecks, as on the right in the Figure above.

We wouldn’t have to build a physical drumhead to see what patterns of vibration we get from them. There’s a differential equation from physics, involving an operator called the Laplacian, that tells us how a drumhead of any given shape will vibrate. The idea is pretty simple: We can tell which way a point will be pulled by looking at the nearby points and whether they are above or below it. If most of the points are above then the point will accelerate upwards. If they’re below, then it will accelerate down. The differential equation allows us to analyze all the points in the drumhead like this at the same time, to figure out the possible patterns.

But here’s the key: If we do this for a drumhead like the one in the Figure with two blobs connected by a bottleneck, then for the lowest frequency wave, the points in one of the blobs will go up whenever the points in the other blob go down (as indicated in the Figure). In between them, there will be an arc of points that don’t move at all, right at the bottleneck. In other words, this arc of stationary points will be very close to the min cut, max flow arc.

The idea behind spectral clustering is that we can do something very similar with a graph: Pretend that the graph is a rigid object made up balls connected by springs. If we hit it with a mallet, it will begin to vibrate in a pattern that is determined by an equation very similar to the differential equation for the drumhead. In this case, we can figure out whether each vertex will be pulled up or down by looking at the vertices that it is connected to by edges.

We end up with an equation defined by a matrix whose rows and columns corresponds to the vertices of the graph. This matrix has a lot in common with the covariance matrix that we saw in the post on principle component analysis (PCA), and in particular the eigenvectors of this matrix will tell us the simple patterns that make up the vibration of the graph.

Like the drumhead, if the graph is made up of a number of clusters connected by bottlenecks, then the low frequency vibrations will be divided into vertices that are moving up and vertices that are moving down. And it turns out that the edges between these two sets will be very close to the min cut max flow/normalized mean cut/cheeger constant. As we discussed above, they won’t be exactly the best, but they will often be close enough to give us a rough idea of the clusters. So, the spectral clustering algorithm consists of forming the matrix that defines the vibration, then calculating its lowest eigenvector, and returning the two clusters defined by the upward moving vertices and the downward moving vertices. The term spectral refers to the fact that the whole list of eigenvectors (or rather their eigenvalues) can be thought of as the spectrum of frequencies that the graph vibrates at.

The first thing to note is that unlike DBSCAN and mapper, spectral clustering does not explicitly rely on any parameters that determine the scale. However, these parameters are really just hidden in the step where we construct the graph, when we decide the rule for which vertices to connect. But in practice, this approach tends to do well over a wide range of parameters, so it is not as important that we pick the parameters just right.

On the other hand, there are some drawbacks to spectral clustering compared to DBSCAN and other methods. The main drawback, and probably the main reason that it isn’t more widely used is that finding the eigenvectors of a large matrix (which is the core of the spectral clustering algorithm) is slow and memory intensive. This makes it hard to scale the algorithm to large data sets.

The drawback that is more troubling from a theoretical perspective is that the spectral algorithm doesn’t tell you if the clusters that it finds are actually any good. With DBSCAN, we know that there is a real separation between the clusters that it finds. If the data set didn’t have clusters, then DBSCAN would either return a single cluster containing all the data points, or a large number of very small clusters. Spectral cluster, on the other hand, will always give you two roughly equal clusters. (There are ways to make it find more than two clusters, but that’s for a future post.) For example, if the data set is a Gaussian blob, spectral clustering will find a relatively efficient way to cut it in half, but it won’t tell you that the cut isn’t actually a bottleneck. So, depending on how you plan to use the clusters, you may need to do further analysis to see if you’ve found legitimate clusters.

Advertisements
This entry was posted in Clustering, Unsupervised learning. Bookmark the permalink.

9 Responses to Spectral clustering

  1. rrenaud says:

    I don’t know about these graph problems specifically, but this paragraph is a little bit worrisome/erroneous.

    “In practice, the small differences between these terms doesn’t matter because there is no computationally efficient way to compute any of them. Essentially, the only way to find the set of edges that achieves the min cut max flow/normalized min cut/cheeger constant is to try every possible combination, which makes the problem NP complete. (In other words, it’s impractical to run it on any data set bigger than perhaps a few hundred points.)”

    Problems are NP-complete because if you can solve them efficiently, then you can solve any other NP-complete problem and you can verify that a given solution is feasible with only a polynomial sized certificate. A problem merely being thought to be difficult or require exponential searches is not enough to make it NP-complete.

    • That’s a good point. I don’t remember which versions of the problem have been shown to be NP complete, but my intuition is that they should all be. But since I don’t know the exact results, I’ll rewrite that section to remove the claim of NP completeness.

  2. rouli says:

    Aren’t there algorithms for finding the min-cut (and surely the max flow) in polynomial time?
    http://en.wikipedia.org/wiki/Minimum_cut

    • The min cut max flow problem above is different from the min cut in the wikipedia article, which doesn’t require that the resulting components are balanced. It’s also different from the setting of the min cut max flow theorem, which looks at the flow between exactly one source and one sink. The problem above requires, in some sense, looking at all possible pairs of sources and sinks. Perhaps calling it normalized mean cut is better, since it avoids this confusion. Right now, there are only polynomial time algorithms for approximating the normalized mean cut, using either the spectral techniques described above, or using integer linear programming. But neither methods is guaranteed to give you the exact best cut.

      • rouli says:

        thanks.
        I guess I’m still missing a few details – wouldn’t “looking at all possible pairs of sources and sinks” just add a V^2 term to the complexity computation?

      • That’s a good point. But looking at all pairs of points would give you a different cut set for each source/sink pair. We’re looking for a single cut that minimizes the number of edges relative to the number of pairs of vertices on either side. It’s also common to try to normalize based on the number of uncut edges on either side, which has no relation to the max flow min cut theorem.

  3. Pingback: Modularity – Measuring cluster separation | The Shape of Data

  4. Pingback: TILO/PRC Clustering | The Shape of Data

  5. Pingback: What skills will you develop in 2016? Building, Connecting and Applying Skills | TeamFit Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s