Over the last few weeks, I’ve introduced two classification methods – Support Vector Machines (SVM) and Logistic Regression – that attempt to find a line, plane or hyperplane (depending on the dimension) that separates two classes of data points. This has the advantage over more flexible methods like K Nearest Neighbors, that once the line/plane/hyperplane is found, new data points can be very quickly classified. This simple model is also less susceptible to overfitting. Both SVM and logistic regression work well when such a line/plane/hyperplane exists (when the two classes are linearly separable), but for data that forms a curved shape, there won’t be a line/plane/hyperplane that completely separates the two classes. We saw a similar problem with linear regression and were able to address it by replacing the line/plane/hyperplane with a curve, or a higher dimensional curved shape, which required adding parameters to the model. We could probably do something similar for SVM and logistic regression. However, rather than adapting each algorithm independently in this way, the more common approach is to use a general tool called a kernel to generalize any linear algorithm to use curved shapes.
As usual, we will start by considering two-dimensional data so that we can visualize what’s going on. Our two-dimensional data lives in a two-dimensional plane, which we can think of as a piece of paper. A kernel is a way of placing this two dimensional plane into a higher dimensional space, so that it is curved in the higher dimensional space. (In other words, a kernel is a function from the low dimensional space into a higher dimensional space.) So, for example, we might place the plane into three dimensional space so that it curves up along one axis, as in the Figure below. Here, the cross sections of the plane in three-dimensional space are parabolas.
The goal of the kernel is to make it so that two classes of data points that can only be separated by a curved line in the two-dimensional space can be separated by a flat plane in the three-dimensional space. This may seem strange that in order to make data points that follow a curve into data points that follow a line, we make the plane even more curved. But in fact, by adding more dimensions, what we’re doing is adding more flexibility for the lines/planes/hyperplanes to move around in. For example, in the data set shown below on the left, there is clearly no line that separates the two blue points from the two green points. However, when we place this data set in three dimensions using the kernel from the first Figure, the two blue points are lifted up, as on the right, so that there is now another plane (drawn in red) that separates blue from green, shown in the bottom right. This plane intersects the two-dimensional space in a curve shown in the bottom left figure.
In general, what the kernel does is to make it so that any plane in the three-dimensional space intersects the two-dimensional plane that contains our data in a curved line rather than a straight line. If we use a kernel that places the plane in an even higher dimensional space, then the hyperplanes in these higher dimensional spaces will intersect the two-dimensional space in potentially more complicated curves, giving us more flexibility to choose the curve that separates the data.
We can think about this in terms of adjusting/tuning parameters of the separating curve, as we did in the post on general regression. With a line/plane/hyperplane, the number of parameters that you’re allowed to adjust is equal to the number of dimensions plus one. By placing the space containing the data set into a higher dimensional space, we effectively give ourselves more parameters to choose a hyperplane in the higher dimensional space. Because of how this hyperplane intersects the original space, these extra parameters end up determining how the separating shape in the original space containing the data points can curve around the data.
So, the general procedure is to place the labeled data points into a higher dimensional space using a kernel, then run SVM or logistic regression (or any other classification algorithm) on the data in this higher dimensional space. This will (hopefully) give us a hyperplane that separates the two classes in the higher dimensional space. The intersection of this hyperplane with the curved copy of the original data space will determine a curved shape in the original data space that still separates the two classes of points. In fact, we could also run this process with linear regression, and the results would be effectively the same as using a more general form of regression, as in my earlier post.
I think it’s worth going into a bit more technical detail at this point, but if you don’t like equations, feel free to skip to the next paragraph. One of the simplest examples of a kernel is a function that takes each two-dimensional point to the five dimensional point . A hyperplane in the five-dimensional space is determined by an equation of the form , where and are parameters that we allow SVM or logistic regression to choose for us. Note that the , and parts are meant to indicate the third, fourth and fifth coordinates of the five dimensional space. However, if we think of them as products of s, s and s, then they define a curve in the plane, which happens to be the intersection of the hyperplane with the two-dimensional space that contains the data. Thus we can interpret the results of SVM or logistic regression on the higher dimensional space as a curved separating shape in the lower dimensional space. There are many other types of kernels, using more interesting functions, but the basic principle is the same.
Note that in practice, one generally doesn’t need to explicitly calculate the coordinates of the data points in the higher dimensional space when using a kernel. Instead, it’s often possible to make the calculations needed for a given classification algorithm, such as distances and angles in the higher dimensional space, but without calculating the new coordinates. So if we wanted to get really abstract, we could think of a kernel as a new definition of distances and angles, with or without explicitly placing the data into a higher dimensional space. We’ll explore this idea further in later posts.
For now, keep in mind that a kernel is not a silver bullet that will allow you to solve every classification problem without any effort. The more dimensions we put in our kernel, the more flexibility we have, but picking a kernel with more parameters/dimensions will be more computationally expensive. So it’s better to pick the simplest kernel that you can get away with, and this requires having a really good understanding of your data before you even get to the modeling step.
Another reason to try and pick the simplest possible kernel is that, as with general regression, the more flexibility you allow the model, the greater the risk of over fitting. We can always find a kernel in which the two classes of data points are linearly separated, but this may not be the best one to use. Remember that the model that we want is the one that is best at predicting new data points, but not necessarily the best at predicting existing data. As always, one can get a rough idea if a model is over-fit by splitting the initial data into a training and evaluation set. However, in specific situations there may be better ways. In general, the best way to make sure you have a good kernel and a good model is to understand your objective, understand your data, and understand the tools at your disposal.