In this post, we’ll warm up our geometry muscles by looking at one of the most basic data analysis techniques: linear regression. You’ve probably encountered it elsewhere, but I want to think about it from the point of view of geometry and, particularly, the distributions that I introduced in my previous post. Recall that our goal is to infer a probability distribution from a set of data points. Linear regression follows a very common pattern among modeling algorithms: We choose a basic form that we want the distribution to have, then we choose the distribution that best fits the given data among all distributions of this form.
In two-dimensional linear regression, the general form for a model is a distribution concentrated along a line. A line is determined by two parameters – its slope and it y-intercept – and we want to find the parameters that determine the best fit line for a given set of points. We know that the data points probably won’t all fall right on any one line, so there will always be some error. This is why we have to allow for a blurry probability distribution, as in the last post.
For any given line, we can define a distribution that is equal to one along the line and decreases as we move away from the line. In particular, the probability will be defined by the Gaussian function ( where is the distance), so that as we move away from the line, the probability will follow a bell curve shown in the Figure above. The right side shows a graph of the distribution function – The height of the graph is the probability at each point. This looks like a bell curve extended along the line.
The regression algorithm picks out the one line, out of all possible lines, for which the data points best fit the corresponding distribution. But how do we determine how good the fit is? A given distribution assigns to each point a probability between 0 and 1, indicated below by how dark the point is. You can think of this as the probability that the given point will be randomly chosen. The probability that a collection of points would be chosen at random is the product of their individual probabilities. The regression algorithm chooses the distribution that assigns the highest probability to the given data points. In other words, it chooses the distribution in which the data points are overall in the darkest possible parts of the distribution.
For example, in the figure above, the regression algorithm would choose a line something like the one on the left rather than the one on the right for the blue data points. You can think of this as adjusting two parameters – the slope of the line and its -intercept – until the probability value is maximized. This is kind of like tuning an old-fashioned analog radio: As you move the knob back and forth, the signal gets stronger and weaker and you stop when the signal is as strong as possible. You might also imagine printing the distribution onto a sheet of clear plastic, placing it over the data points and moving it around (translating and rotating) until you find the position where the points are in the darkest region possible.
In practice, the regression algorithm doesn’t actually try all the different parameter values because we can switch to a log scale (which turns multiplying the probabilities into adding the sum of the squared distances) and then apply some tricks from calculus to directly calculate the ideal line. If you’re interested in the details of this, there are plenty of good statistics books out there. Since I want to focus on the geometric intuition, I’m going to skip over it here.
There are two ways to generalize two-dimensional linear regression: We can change two-dimensional to higher dimensional and/or we can change the line to a more flexible shape. I’ll introduce more flexible shapes in the next post, so for now lets consider higher dimensional linear regression.
The usual goal of regression is to predict the value of one variable/dimension based on the other value(s) of the other variable(s). If the two dimensions represent height and age of a collection of trees, then the best fit line will allow you estimate the age of a new tree based on its height. This works because on the line, each x-value determines a unique y-value. In three dimensions, a two-dimensional plane has a similar property: If we know the values of any two of x, y or z then we can predict the third. Three-dimensional regression therefore involves fitting a plane to a data set with a distribution that is equal to one on the plane and follows a bell curve as we get farther from the plane. For example, if the data set records the height, width and age of a collection of trees, the best fit plane will allow you to predict the age of a tree based on its height and width (or predict the height of a tree based on its width and age, etc.)
Notice that for both two-dimensional and three-dimensional linear regression, the distribution is concentrated near a shape whose dimension is one less than the whole space (a one-dimensional line or a two-dimensional plane, respectively). The technical term for the dimension of a space minus the dimension of a shape in that space is the co-dimension of the shape. So in two- and three-dimensional linear regression, we’re looking for a shape whose codimension is equal to one. In general, in a codimension-one shape, the value of one variable will be determined by the other variables.
In four-dimensional space, a shape with co-dimension one will be three-dimensional. So four-dimensional linear regression (i.e. regression with four variables) involves distributions concentrated near three-dimensional hyperplanes. There’s essentially no hope of visualizing this, but the two- and three-dimensional cases should give you some idea of how it works. Each point is defined by three variables, say x, y, z and w and once you’ve found the best fit hyperplane, if you know any three of the values then you can predict the fourth. Higher dimensional linear regression is similar, and in each dimension the goal is to find the best distribution concentrated near a hyperplane whose dimension is one less than that of the whole space.
References: In the comments, Saeid suggested that I add references. I don’t know the Statistics literature very well, but the following books look helpful, based on Amazon.com. If there are any books you like for their description of regression, please let me know in the comments.
- The Geometry of Multivariate Statistics – Thomas D. Wickens
- Statistical Methods: A Geometric Primer – David Saville & Graham Wood
- Statistical Methods: The Geometric Approach – David Saville & Graham Wood
- Advanced Data Analysis from an Elementary Point of View – Cosma Rohilla Shalizi (with a link to a PDF of the first draft)
- The Elements of Statistical Learning – Trevor Hastie, Robert Tibshirani and Jerome Friedman