So far on this blog, we’ve seen two very different approaches to constructing models that predict data distributions. With regression, we replaced the original data points with an equation defining a relatively simple shape that approximates the data, then used this to predict one dimension/feature of new points based on the others. With K Nearest Neighbors, we used the data points directly to define a fairly complicated distribution that divided the data space into two (or more) classes, so that we could predict the classes of new data points. Today, we’ll combine elements of both. We’re going to stick with classification, splitting the data space into two classes, but the goal will be to replace the original data with a simplified (linear) model.
In linear regression, the line/plane/hyperplane that we found was calculated to be as closes as possible to all the data points. That way, a distribution concentrated near the line will be a reasonable summary of the data. But for classification, we will want the line/plane/hyperplane to represent the border between the two differently colored regions (i.e. the two classes). That is, we want to have all the blue points on one side of the lineplane/hyperplane and all the green points on the other side. This may not always be possible, in which case we’ll want to choose a line/plane/hyperplane that minimizes the number of points that are on the wrong side.
Let’s first consider the case when there is a line/plane/hyperplane that separates the green and blue points. In this case, we’ll say that the two classes are linearly separable. Generally, if there’s one such line then there will be a number of different ones, and we will still have to make a choice. Since we expect there to be noise, i.e. a discrepancy between the initial data and the actual underlying distribution, we will want a line that/plane/hyperplane that gives us the biggest possible buffer against error. This is what a Support Vector Machine (SVM) tries to do.
In the case of two-dimensional data, we can think of this as replacing the the line with a thick rectangular error band, that the data points are not allowed into. This is shown as a tan bar around the line in the Figure below. For the line on the left, the error band is relatively narrow because it bumps up against the green point on the right. You can imagine the error bar trying to expand, but bumping up against the data points. So, it pushes back on the line, moving the line up until it hits a blue point, then making the line rotate until it’s wedged between three points, as on the right.
In practive, the SVM algorithm doesn’t really move the line around. There’s a relatively simple equation that picks out the line with the biggest error bar around it. (See, for example, the wikipedia page for details.)
As suggested by the picture, this maximum distance condition is completely determined by only the few data points that are closest to the border between the two classes. These are the “support vectors” that the algorithm is named after: “support” because we can think of the error band as being held in place by these data points and “vector” because computer scientists like to think of data points as vectors.
For three-dimensional data, the error band becomes a three-dimensional error region that comes from thickening the two-dimensional plane. For higher dimensional data, we get a higher dimensional error region around a hyperplane, though as always we will think about these by analogy to the lower dimensional setting. Once we’ve found the line/plane/hyperplane with the thickest error region, we can classify new data points by checking whether each new point is on the blue side or the green side. Notice that this is a very efficient summary of the difference between the two data sets: Once we’ve found the line/plane/hyperplane, we can forget the original data points, and we can very quickly classify new points. The trade-off is that we could potentially lose some important structural details. We can minimize this issue by replacing the plane with a more complicated shape using an idea called a kernel, but this will have to wait for a future post.
We also need to consider the case when the classes are not linearly separable, so every line/plane/hyperplane will have some blue points on the green side and vice versa. An example of this is shown in the Figure below. In this case, rather than maximizing the distance from all the points to the line, we will minimize the distance from the points on the wrong side to the line. In other words, we require that the error band is thick enough so that it contains all the points on the wrong side, as on the left in the Figure below. But rather than having the band expand and push against the points, we have it shrink and pull against the points inside it. Since the data points are fixed, this will instead move the line to a position that minimizes the thickness of the band (or the higher dimensional region in the case of higher dimensional data.)
As it turns out, we don’t need a new algorithm for this case – the same algorithm that works in the linearly separable case will work in the non-linearly-separable case as well. (If you look at the equation of a line/plane/hyperplane, we need essentially the same equation, except that we maximize the negative of the distances of the misclassified points.)
Like any data algorithm, SVM has its limitations. As we noted above, this version of SVM is limited to very simple models, namely lines/planes/hyperplanes. But as we’ll see in upcoming posts, this can be fixed by using what’s called a kernel.
The second limitation of SVM is that, because the hyperplane is defined by a small number of data points, it can be thrown off by a small amount of noise or a few bad data points. In other words, if you have two classes of data points that are mostly very well separated, except for a few data points that are much closer to the opposite class, then the final line/plane/hyperplane will be mostly determined by this small number of points, since that’s what the error region will run into first. Moreover, since those points are very far from the other data points in their class, there’s a good chance that they’re caused by either mistakes in the data, or by outliers that don’t represent the general distribution. There are modified versions of SVM that try to avoid this by using a different criteria to select the hyperplane based on all the data points, though as far as I can tell, there doesn’t seem to be a clear front runner among them.