A few weeks ago, I mentioned the idea of a clustering algorithm, but here’s a recap of the idea: Often, a single data set will be made up of different groups of data points, each of which corresponds to a different type of point or a different phenomenon that generated the points. For example, in the classic iris data set, the coordinates of each data point are measurements taken from an iris flower. There are 150 data points, with 50 from each of three species. As one might expect, these data points form three (mostly) distinct groups, called clusters. For a general data set, if we know how many clusters there are and that each cluster is a simple shape like a Gaussian blob, we could determine the structure of the data set using something like K-means or a mixture model. However, in many cases the clusters that make up a data set do not have a simple structure, or we may not know how many there are. In these situations, we need a more flexible algorithm. (Note that K-means is often thought of as a clustering algorithm, but note I’m going to, since it assumes a particular structure for each cluster.)
In particular, most clustering algorithms will not attempt to determine the overall shape (i.e. the extrinsic structure) of each cluster – they will simply try to decide which points fall into which clusters. For this reason, clustering algorithms tend to focus on the intrinsic structure of a given data set, which as we saw last week can be encoded by a graph.
We’ll start out by looking at the simplest possible clustering algorithm. Like the nearest neighbors algorithm, which I used to introduce the idea of classification algorithms, this algorithm is generally too simple to use in practice. But it will be good for a warm up.
Recall that for a given data set in vector form, we can form a graph that represents its intrinsic structure by adding a vertex for each data point and adding edges between nearby data points, either connecting any two edges whose distance is below some threshold or connecting each data point to its K closest neighbors for some value K. If we do this to a data set that is made up of a small number of well separated clusters then we would expect each data point to be closest only to other data points in its own cluster. If this is the case then the vertices of each cluster will form their own separate graph (called a connected component) with no edges between the different clusters.
So our first naive approach to clustering will be to simply form a graph from the data set, then determine the connected components of the graph. If there happens to be a relatively small component that is disconnected from the rest of the graph then we’ll decide that it’s a cluster. (Incidentally, I don’t know if there is a name for this algorithm – As I said, it doesn’t seem to be used in practice.)
There are a number of immediate issues with this approach. The first is that it relies heavily on choosing the right method and the right parameters for building the graph. When we choose the cutoff for how far apart the vertices that we connect should be or how many closest neighbors to connect, we are essentially deciding what scale we want to look at the data at. You can think of forming the graph as applying a blur to the data set and connecting vertices whenever the resulting clouds overlap. The larger our distance threshold or our value K, the more we are blurring the points. So this is very similar to an issue that we discussed in the post on probability distributions. If we blur things too much then distinct clusters will end up overlapping and forming a single connected component. If we don’t blur things enough then the gaps between points in a single component may be big enough to separate the resulting graph, making one component appear to be many clusters.
It’s also possible that there will not be a single parameter value that separates the correct clusters without creating unwanted separations within individual clusters. This is the case, for example, with the Iris data set, in which the clouds of data points corresponding to two of the species are close enough that they overlap slightly.
A second problem with this algorithm is that it is possible for a few stray data points, often caused by noisy data, to create unwanted connection between clusters. This is particularly a problem if we connect each vertex to its K closest neighbors – A stray data point that is half way between two clusters will form edges to vertices in each cluster, making them into a single connected component. This is what happens in the (oversimplified) data set shown on the left in the Figure below, which consists of two clusters of four points (one cluster on the left and one on the right) connected by a long chain of data points. The resulting graph would have a single connected component.
The DBSCAN algorithm is a modification of the basic clustering algorithm described above, designed to avoid these issues. (The acronym stands for Density Based Spatial Clustering of Applications with Noise.) The basic premise behind the algorithm is that we expect each cluster to be defined by a probability distribution that is darker in the center and lighter near the edges. In other words, the probability of picking a point near the central core of the distribution should be much higher. (Note that the central core may be a complicated shape.) If we randomly pick a large number of points from the distribution, we’ll get points from both the central core and the outer edges, but we should be able to tell the difference between them because the outer edge points will be farther apart from each other. In other words, we can tell the difference by how dense the points are.
If two clusters overlap slightly, the overlap will be within the outer edges of the distributions, rather than the central cores. (If they overlap a lot then the central cores will overlap, but this situation is much harder to deal with, so lets not worry about it for now.) So the idea behind DBSCAN is to try and ignore the data points that are more spread out and focus on the dense parts, which should be the central cores of the clusters.
We still have to choose two parameters, which means that we have to make a judgement call about the “scale” that we want to work at. (The problem of picking the “correct” scale is more or less unavoidable – Next week, I’ll discuss methods that rely less on having the correct scale, but they will still be affected somewhat by parameters.) In fact, we will pick two numbers: The first number is a distance d, which determines how large of an area around each data point we want to look at. (This parameter is often called , but I’ll stick with Roman letters to minimize Greek-symbol anxiety.) The second number is a density value K.
We will say that a data point is dense if there are at least K other data points that are distance at most distance d away from it. This is roughly the property that we would expect points along the outer edges of a density distribution to have. The DBSCAN algorithm works by connecting all pairs of data points such that the the two data points are distance at most d away and one of them is dense. Then, as in the basic clustering algorithm that we started with, we take each connected component of the resulting graph to be a cluster.
To understand how this compares to the basic clustering algorithm, consider the example in the Figure above. The blue disk around each data point represents all points in the data space that are distance less than d away from that point. As noted above, if we construct a graph by connecting each data point to all other data points within distance d, i.e. within each blue disk, we will get a single connected graph. You can think of this as defining a distribution whose density function is made up of all these disks, i.e. the blue density cloud shown on the left of the Figure. This distribution appears to be made up of a single piece and, sure enough, the basic clustering algorithm will tell us that there is a single component.
If we run DBSCAN with K = 3 then only the data points in the two clusters will be dense. (Each of the data points along the path only has two other data points inside its blue disk.) The resulting graph will have two components, one connecting the four points of the cluster on the left and one connecting the four points of the cluster on the right. (Technically, it will have four components, since each of the two data points in the middle forms its own connected component with no edges, but DBSCAN will not mark these as clusters.)
Again, we can think of this as forming a density distribution made up of small disks around the points. But the distribution defined by DBSCAN leaves out the disks around the four points in the middle that aren’t dense, as on the right in the Figure. In this case, the resulting distribution has two distinct pieces and DBSCAN finds two distinct clusters.
Remember that we are assuming/pretending that there is an underlying probability distribution that produced this data set. Neither of the distributions suggested by either the basic clustering algorithm or DBSCAN will give us exactly the original distribution. In particular, the original distribution will probably have lots of grey-scale values, while the distributions produced by these two algorithms are black and white. The difference between the two algorithms is how they decide what to do with the apparent grey areas: The basic clustering algorithm makes all the grey areas black, by including points in the outer areas of the probability cloud in the distribution that it constructs.
DBSCAN, on the other hand, turns the grey scale areas to white, leaving the outer fringes of the distributions out of the distribution that it constructs. This means that some points may be left out all together, such as the points in the middle above. (These left out points are often called outliers and there’s a whole family of algorithms for the process called outlier detection.) So DBSCAN may leave some points out of the clusters that should ideally be in there. But that’s OK, since the goal of a clustering algorithm is to determine roughly where the clusters are, even if it doesn’t put every single point in the correct cluster. It’s much better to leave some points out than to combine multiple clusters that should be distinct.
DBSCAN seems to be a very popular clustering algorithm in practice (second only to K-means, which as I mentioned above shouldn’t be considered a clustering algorithm at all.) That’s because it is very fast, but still flexible enough that it tends to do a good job of finding complex clusters. There are limitations, of course: As noted above, the results rely heavily on the choice of parameters d and K. This can be fixed to a certain extent by running it multiple times with different parameters. The perhaps more important limitation is that it does not work as well on data sets where two clusters overlap more than just along their periphery, or where different clusters have different internal densities. (In the later case, DBSCAN might decide that a less-dense cluster is actually just a bunch of outliers.) In the next few posts, I’ll describe a number of other clustering algorithms that attempt to overcome these problems.