In this post, we’ll warm up our geometry muscles by looking at one of the most basic data analysis techniques: linear regression. You’ve probably encountered it elsewhere, but I want to think about it from the point of view of geometry and, particularly, the distributions that I introduced in my previous post. Recall that our goal is to infer a probability distribution from a set of data points. Linear regression follows a very common pattern among modeling algorithms: We choose a basic form that we want the distribution to have, then we choose the distribution that best fits the given data among all distributions of this form.

In two-dimensional linear regression, the general form for a model is a distribution concentrated along a line. A line is determined by two parameters – its slope and it *y*-intercept – and we want to find the parameters that determine the best fit line for a given set of points. We know that the data points probably won’t all fall right on any one line, so there will always be some error. This is why we have to allow for a blurry probability distribution, as in the last post.

For any given line, we can define a distribution that is equal to one along the line and decreases as we move away from the line. In particular, the probability will be defined by the Gaussian function ( where is the distance), so that as we move away from the line, the probability will follow a bell curve shown in the Figure above. The right side shows a graph of the distribution function – The height of the graph is the probability at each point. This looks like a bell curve extended along the line.

The regression algorithm picks out the one line, out of all possible lines, for which the data points best fit the corresponding distribution. But how do we determine how good the fit is? A given distribution assigns to each point a probability between 0 and 1, indicated below by how dark the point is. You can think of this as the probability that the given point will be randomly chosen. The probability that a collection of points would be chosen at random is the product of their individual probabilities. The regression algorithm chooses the distribution that assigns the highest probability to the given data points. In other words, it chooses the distribution in which the data points are overall in the darkest possible parts of the distribution.

For example, in the figure above, the regression algorithm would choose a line something like the one on the left rather than the one on the right for the blue data points. You can think of this as adjusting two parameters – the slope of the line and its -intercept – until the probability value is maximized. This is kind of like tuning an old-fashioned analog radio: As you move the knob back and forth, the signal gets stronger and weaker and you stop when the signal is as strong as possible. You might also imagine printing the distribution onto a sheet of clear plastic, placing it over the data points and moving it around (translating and rotating) until you find the position where the points are in the darkest region possible.

In practice, the regression algorithm doesn’t actually try all the different parameter values because we can switch to a log scale (which turns multiplying the probabilities into adding the sum of the squared distances) and then apply some tricks from calculus to directly calculate the ideal line. If you’re interested in the details of this, there are plenty of good statistics books out there. Since I want to focus on the geometric intuition, I’m going to skip over it here.

There are two ways to generalize two-dimensional linear regression: We can change two-dimensional to higher dimensional and/or we can change the line to a more flexible shape. I’ll introduce more flexible shapes in the next post, so for now lets consider higher dimensional linear regression.

The usual goal of regression is to predict the value of one variable/dimension based on the other value(s) of the other variable(s). If the two dimensions represent height and age of a collection of trees, then the best fit line will allow you estimate the age of a new tree based on its height. This works because on the line, each *x*-value determines a unique *y*-value. In three dimensions, a two-dimensional plane has a similar property: If we know the values of any two of *x*, *y* or *z* then we can predict the third. Three-dimensional regression therefore involves fitting a plane to a data set with a distribution that is equal to one on the plane and follows a bell curve as we get farther from the plane. For example, if the data set records the height, width and age of a collection of trees, the best fit plane will allow you to predict the age of a tree based on its height and width (or predict the height of a tree based on its width and age, etc.)

Notice that for both two-dimensional and three-dimensional linear regression, the distribution is concentrated near a shape whose dimension is one less than the whole space (a one-dimensional line or a two-dimensional plane, respectively). The technical term for the dimension of a space minus the dimension of a shape in that space is the *co-dimension *of the shape*.* So in two- and three-dimensional linear regression, we’re looking for a shape whose codimension is equal to one. In general, in a codimension-one shape, the value of one variable will be determined by the other variables.

In four-dimensional space, a shape with co-dimension one will be three-dimensional. So four-dimensional linear regression (i.e. regression with four variables) involves distributions concentrated near three-dimensional hyperplanes. There’s essentially no hope of visualizing this, but the two- and three-dimensional cases should give you some idea of how it works. Each point is defined by three variables, say *x*, *y*, *z* and *w* and once you’ve found the best fit hyperplane, if you know any three of the values then you can predict the fourth. Higher dimensional linear regression is similar, and in each dimension the goal is to find the best distribution concentrated near a hyperplane whose dimension is one less than that of the whole space.

**References:** In the comments, Saeid suggested that I add references. I don’t know the Statistics literature very well, but the following books look helpful, based on Amazon.com. If there are any books you like for their description of regression, please let me know in the comments.

- The Geometry of Multivariate Statistics – Thomas D. Wickens
- Statistical Methods: A Geometric Primer – David Saville & Graham Wood
- Statistical Methods: The Geometric Approach – David Saville & Graham Wood
- Advanced Data Analysis from an Elementary Point of View – Cosma Rohilla Shalizi (with a link to a PDF of the first draft)
- The Elements of Statistical Learning – Trevor Hastie, Robert Tibshirani and Jerome Friedman

You are missing the negative sign in the gaussian distribution in front of the squared distance.

Good call! Thanks, it should be correct now.

Thank you for sharing your thoughts on the geometry of large data sets. It would be nice if you could add a few references (textbooks, papers, pre-prints, …) to your blog posts.

That’s a good idea. I haven’t really looked at enough statistics textbooks to know which ones to recommend, but I’ll look into it and add some references. For later topics where there are more obvious references, I’ll definitely include them. Thanks!

Very nice series of posts. I just have one minor comment.

There is a subtlety here. When you rotate your grayscale layer representing the Gaussian, it changes the scale and variance of vertical slices. Classic least squares uses a fixed variance normalized gaussian distribution vertically. So you probably would get a slightly different fit with your visual approach than the standard least squares regression. It’s pre-coffee early so I might have this backwards but I think the “rotating the paper” fit will bias the regression toward zero slope a bit.

Hmm… That’s a good point. In practice, the least squares algorithm may use the difference in the y-value between the line and the point, rather than the distance from the line to the point and the difference will depend on the slope of the line. The general shapes of the distributions will be the same for these two methods, but the values will be different and the conversion between values will depend on the slope.

Another way to think about it is that using the difference in y-values rather than the distance corresponds to skewing the distribution rather than rotating it. Skewing is a little less intuitive than rotating, so I’m going to leave it in the post as is. But in my next post on general least squares regression, I’ll define the distribution using the difference in y-value. (In fact, the distance to the curve doesn’t work in the more general context.) Thanks for pointing this out!

But as far as bias goes: If you use difference in y-value then the distribution becomes more tightly concentrated (relative to distance to the line) as the slope increases. So I think this should make the method using y-value difference more biased towards horizontal lines than the method based on distances. (Do I have this right?) The method based on distance is independent of coordinates, so in principle it shouldn’t be biased in terms of slope.

Yeah I think that’s right. By bias I meant with respect to the y-value difference likelihood, which I was considering the classic least squares problem. You’re right that it’s independent of coordinates and I guess that’s the point: that x is an independent variable and y is a dependent variable. There’s no natural distance in this x-y plane since x and y are fundamentally different. I totally agree that skewing is hard to get an intuition on, and that the post is very nice as is.

Pingback: General regression and over fitting | The Shape of Data

This is an excellent (and free!) textbook by Cosma Shalizi:

*Advanced Data Analysis from an Elementary Point of View*

http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/

It was written for an undergraduate course in data analysis offered in Carnegie-Mellon’s statistics department, but in my opinion could serve as a textbook for anything from an advanced undergraduate to early graduate level course.

Thanks! I added it to the list.

Pingback: Linear Separation and Support Vector Machines | The Shape of Data

Pingback: Logistic regression | The Shape of Data

Pingback: Kernels | The Shape of Data

Pingback: Neural Networks 1: The neuron | The Shape of Data

Pingback: Mixture models | The Shape of Data

Pingback: Intrinsic vs. Extrinsic Structure | The Shape of Data

Pingback: The shape of data | spider's space

Pingback: Graphs and networks | The Shape of Data

Couldn’t 4-dimensional data be visualized as 3-dimensional data changing over time? (Ditto for visualizing 3-dimensional data in 2 dimensions, 2 in 1, or 1 in 0?)

That’s a good question. It’s often useful to visualize a shape in four-dimensional space by taking three-dimensional cross-sections and displaying them over time. For a data set, with discrete points, however, each data point would only show up at a single point in time. So these cross sections would just show up as occasional blips flashing on the screen. If you had a huge amount of data, there might be enough blips to notice a pattern as they flashed on the screen, but in general this probably wouldn’t work so well.

This approach would probably work better for visualizing models/probability distributions, which actually have meaningful cross-sections, since they “fill in the gaps” between the data points.

It’d be an interesting cognitive science experiment to see how much the ability to “fill in the gaps” and/or recognize/remember/compare patterns in a data set is affected by transforming data from N visual dimensions to N-1 visual dimensions plus time. I imagine it’d be affected quite a bit by the exact details of the presentation, e.g. speed, persistence, fade in/out, accumulation, etc.

The problem of correctly scaling the speed for that sort of presentation reminds me (very) tangentially of David Attenborough’s “Private Life of Plants” series, in which speeded up time-lapse recordings of plants makes their patterns of movement immediately obvious where trying to pick up the same patterns while watching them in real time would be well beyond the average person’s attention span. (It’d certainly be well beyond mine, anyway! 🙂

Pingback: Optimization | The Shape of Data

Pingback: P-values | The Shape of Data

Pingback: Continuous Bayes’ Theorem | The Shape of Data

Great blog, thanks for sharing.

Question about the following statement:

“For any given line, we can define a distribution that is equal to one along the line and decreases as we move away from the line.”

Isn’t the ENTIRE area under a given probability density function curve (integral from negative infinity to positive infinity) equal to one by definition? If so, how is the peak of the distribution equal to one?

That’s a great question, Justin, and it just so happens that I addressed it in my recent post on Continuous Bayes’ Theorem. The short answer is that the probability of selecting a given point isn’t defined by the value of the density function. In fact, the probability of picking any given point is zero. Instead, the probability of picking some point in a finite region is given by the integral of the density function over that region. Because of the way integrals work, a density function may be 1, or even greater than 1 at certain points in a region while the integral over that region is less than 1.

However, there is a different problem with the way I defined the probability density function for the regression line: Since the line is infinitely long, the integral of the density function over the whole plane is infinite. (It should be 1.) But since I’m only using it as a way of getting intuition about the least squares idea, you’ll hopefully be willing to overlook that…

Pingback: P-values | A bunch of data