In the last post, I discussed the statistical tool called linear regression for different dimensions/numbers of variables and described how it boils down to looking for a distribution concentrated near a hyperplane of dimension one less than the total number of variables (co-dimension one). For two variables this hyperplane is just a line, which is what you may usually think of regression as. In this post, I’ll discuss a more flexible version of regression, in which we allow the line or hyperplane to be curved.
First, we need to look at regression from a slightly different perspective. When we originally fit a line to our data set, we treated the x and y coordinates interchangeably. We can also think of a line as a function that takes a value x and outputs a value y = cx + b for some pre-chosen parameters c and b that are determined by the regression algorithm. This makes it explicit that we are using the value of x to predict the value of y. Similarly, if we have a larger number of variables , we can describe a hyperplane by a function , where and are parameters that are calculated by the regression algorithm.
From this perspective, if we want to make regression more flexible, there seems to be a clear solution: We can replace linear function that defines a hyperplane with a more complicated function that has more parameters and thus can be better fit to the data. In two dimensions, for example, we could replace the line with the parabola . We now have three parameter (d, c and b) and the resulting graph will be allowed to curve up or down, as in the middle picture below.
We can again define a probability distribution concentrated along the parabola and use it to calculate the parameters that maximize the probability of a given data set, like we did with linear regression. But now we have three “dials” to tune with, instead of two, so we will be able to get a better value than with just a line. Adjusting each parameter moves the parabola in some way – shifting it up and down, left and right, or steeper and shallower, and the distribution will move with it. Rather than defining the distribution based on the distance from each data point to the curve, we’ll define it using the difference in the y-value between the data point and the point on the curve with the same x-value as the point. This turns out to be easier to work with when one actually writes the algorithm, and only changes the distribution slightly. (See the comments on the linear regression post for a discussion of the difference.) The standard approach is to take the Gaussian function of the difference squared, which gives us least squares regression, but there are other distributions one can use as well.
A parabola gives us more flexibility than a line, but it is still relatively rigid – It can only curve up or curve down. If, for example, our data follows an S-shaped curve, we will need at least a cubic curve () to describe it. As we try to model more complicated shapes, we need to add more terms and more parameters. We could also use functions besides polynomials. But this flexibility also means that we have to decide how much of the apparent complexity in the data set is the actual structure of the data and how much is noise. For example, in the picture on the right side of the Figure, the curve fits the data perfectly, but is much more complicated than is probably useful.
In particular, given any set of data points with distinct x-values, it is possible to find a polynomial function that exactly passes through each point, as in the picture on the right. This function will almost always be far more complicated than is needed to describe the data, and it’s an example of what’s called over-fitting: The distribution suggested by such a function does an excellent job of describing the existing data, but will do a lousy job of predicting new data. This is an issue that I touched on in the post on distributions. We will generally want to choose a polynomial somewhere in between – a function that has enough flexibility to capture the structure of the data, but restricted enough that it does not overfit.
But before we get into the problem of finding this perfect medium, we should consider general regression functions for higher dimensional data sets. If we have three variables x, y, z, we can replace the two-dimensional plane with a degree-two polynomial such as . This defines a two-dimensional shape called a paraboloid that curves in the and directions. This function has six parameters and we can “tune” them by defining a probability distribution based on the difference in the z-value between a data point the paraboloid. As always, we choose the parameters that maximize the probability of the data set with respect to the distribution defined by the parameters. If we increase the dimensions and/or increase the powers of the variables (such as ) then the number of parameters grows very quickly, but the model is always a codimension-one shape. (And again we can also use functions other than polynomials.) As the number of parameters increases, the model will fit the data points better, but it will also become more complicated and the risk of over fitting will increase.
So how do we select the correct number of parameters/level of complexity? There’s no hard and fast rule for this, and this is the area where data analysis take more the form of an arts than a science. If you have experience with a certain type of data (domain expertise) or have a rough theoretical model for the data, this can often guide your choice of a regression model. This is a problem where I suspect that a better understanding of the geometry could lead to some interesting new approaches. In general, though, the standard practice is to divide your data set into two sets, one called the training set T and the other called the evaluation set E. We will think of the training set as the existing data, and the evaluation set as the “new” data that we want the regression model to predict. In the pictures below, we’ve left the training set colored blue, but changed the evaluation set to orange.
If we run regression on the training set with a relatively simple curve, as on the left, the curve still stays fairly close to the evaluation set. However, if we run regression with the more flexible curve, but using only the data points from training set then the resulting curve (on the right) passes through every training point, but is farther from the evaluation points.
In general, for each choice of a type of function (line/parabola/etc.), we’ll choose the parameters by maximizing the probability of the training set T for the resulting distribution. This may be overfit or underfit, but we can check this by calculating the probability of the evaluation set E with respect to each resulting distribution. The one with the best value for E is the least likely to be overfit. So even if we do this for a higher dimensional data set in which we can’t actually see the resulting “curves”, the scores for the evaluation set would suggest which curves suffer from over-fitting. If we do this for many different types of functions and choose the one that does the best job of predicting the evaluation data, we can be relatively confident that this model is a good median between overfitting and underfitting.