I recently finished reading Nate Silver’s book The Signal and the Noise, which has gotten me thinking about how exactly one should interpret models/probability distributions, and the predictions they make. (If you’ve read this book or plan to read it, I also recommend reading Cathy O’Neil’s review of it.) Ultimately what a model does is to make a claim about the probability that a certain statement is or is not true. I always found this idea slightly troubling, since any fact is either true or false; to say that a true statement has a 70% probability of being true seems kind of meaningless, even if you don’t know that it’s true. When you’re making predictions about future events, where the statements aren’t yet true or false at all, this seems even more problematic. But it turns out that one can make philosophical sense of these sorts of statements by making a slight adjustment, namely saying the something has a 70% probability of being true **given what we know about it**. To explain what I mean by this, I want to introduce the idea of a configuration space.

Lets say that we have a horizontal robot arm that is attached to a square table by a joint that can spin 360 degrees. The arm consists of two pieces, connected by a second joint that can also spin horizontally 360 degrees, as on the left in the Figure below.

Here’s the game: This arm is behind a screen so that we can’t see it. The arm is programmed to randomly select two angles, then move its joints to those angles. Once it’s done, it beeps and we have to guess the coordinates of the hand at the end of the arm, with respect to the table. This is, of course, going to be difficult since we have essentially no information to go on. However, we’re not completely in the dark because we know from the mechanics of the arm that there are certain places it is more likely to be than others.

If you ever played with spirograph (a toy from when I was a child that I was pleasantly surprised to discover you can still buy today) you know that when you move in two circles like this, you spend much more time near the outside and near the center than in the area in between. So the probability distribution defined by the possible locations of the hand looks something like the picture on the right of the image above: It is darker along the edge of the circle of possible locations of the hand, as well as in the very center of the circle, both areas where the hand is more likely to be. So if we wanted to guess the coordinates, our best bet would be right in the middle of the table.

As I hinted at above, this robot arm is an example of what’s called a *configuration space*: We have a number of variables (the angles of the arms) that define different possible configurations. The configuration space is a way of thinking about all possible configurations of a system (such as the robot arm) in a unified way.

This is similar to the idea behind the data spaces that we’ve seen in past posts, made up of all possible data points, but with one important difference: The data spaces we’ve seen were all vector spaces, with their points uniquely defined by their coordinates. In this configuration space, however, each configuration is defined by two angles. The set of all possible angles make up a circle rather than a line, so the “shape” of the configuration space is actually a two-dimensional torus (a doughnut) like the one shown to the right. You can think of this as the result of dragging the smaller circle defined by the elbow angle around the larger circle defined by the center angle.

Meanwhile, each configuration in the configuration space defines a vector (given by two coordinates) in the vector space defined by the table. Note that except for the points right at the inner edge and the outer edge, each vector is actually defined by two different configurations/positions of the arm: one in which the elbow joint bends to the left and one in which it bends to the right.

In topology, this is called a *map* or *function*: a way of identifying each point in one space (the doughnut) with a single point in some other space (the vector space). It is similar to the traditional notion of a map as a piece of paper in which each point in some geographical region is identified with some point on the paper. Just as paper maps sometimes distort the relationships between the points that they’re representing, such as when one makes a flat map of the earth, the map defined by the robot arm distorts different parts of the doughnut-shaped configuration space in order to fit it into a flat square. This distortion is recorded by the probability distribution, which becomes denser in the places that are squeezed together (near the outside and inside of the region) and less dense in the places that get stretched.

But what does this have to do with the sorts of data analysis that I’ve described in the past on this blog? If we’re trying to predict what restaurant someone will want to go to for lunch or whether or not a text message is spam, there isn’t a mechanical configuration space behind a curtain determining the answer.

However, from a more philosophical point of view, it can often be useful to think of situations in this way. For example, let’s imagine hypothetically that there was a configuration space of all possible people. This certainly wouldn’t be a vector space since people are much more complicated than that. And it would be infinitely more complex than the doughnut defined by the robot arm. But the actual structure of this configuration space doesn’t matter. What does matter is that when we start recording data about people, such as their food preferences, we are essentially building a map from this (hypothetical) configuration space to the data space.

Because we don’t know the structure of this “configuration space of people” the way we understand the configuration space of the robot arm, we can’t explicitly determine what the probability distribution defined by the way this map stretches and squishes points looks like. That’s why we have to build an approximation of it based on that data points that we think of as being sampled from it.

But the key to this perspective is that when we have a map from a configuration space to a data space, a lot points in the configuration space get squished together in the data space. This happens with the robot arm because each vector of coordinates comes from two different arm positions. So, for example, if we knew the coordinates and wanted to predict the angle of of the elbow, there would be two possibilities, each of them equally likely. In other words, **given our knowledge of the coordinates**, there would be a 50% chance of either of the two possible angles being correct.

So this gives us one way to interpret probabilistic statements about the present. But what about the future? Well, if we’re willing to believe that a given configuration space is deterministic, i.e. that the configuration in ten minutes is completely determined by the current configuration, then a probabilistic statement about the current configuration can easily be translated into a probabilistic statement about the future. For example, if we knew that each joint in the robot arm rotated at a constant (known) speed then knowing its current configuration would tell us what the configuration would be in ten minutes. But if all we know is the coordinates, then there are two possible configurations that it could be in in ten minutes. So that would give us a probabilistic statement about the future.

A better example of this is weather forecasting. (And it was while reading the weather chapter of *The Signal and the Noise* that I really started thinking about this.) The weather is determined by the movement of massive numbers of molecules in the Earth’s atmosphere, by photons arriving from the sun, and a handful of other physical phenomena. From the perspective of Newtonian physics, this makes the weather one giant configuration space – Just like the robot arm, but instead of the moving parts being two joints, the moving parts are all of the particles in the atmosphere bouncing off of each other.

According to Newtonian physics, this model is deterministic: If we could know the positions and velocities of all these particles at a given time, we could (in theory) determine exactly what their positions and velocities will be in ten minutes, or better yet tomorrow morning. Quantum physics, on the other hand, suggests a bit on non-determinism, but lets not worry about that right now.

When weather forecasters create their models, they can’t keep track of every single air molecule. Instead they take measurements from a relatively small number of weather stations throughout the world. These measurements define a map from this massive and unwieldy configuration space to a data space defined by the temperature readings, etc.

So, when NOAA states that there is a 20% chance of rain next Tuesday, they are in effect, saying the following: Of all the possible configurations that the atmosphere could be in, given these readings, 20% of them would lead to rain on Tuesday and the other 80% wouldn’t. In other words, the there is a 20% chance of rain, **given what is known about the weather configuration space.** In theory at least, that’s what we should expect other types of forecasts to mean as well, though as *The Signal and the Noise* beautifully explains, in practice some types of forecasts are better than others.

You might be interested in reading about configuration spaces of fluids here: http://johncarlosbaez.wordpress.com/2012/03/12/fluid-flows-and-infinite-dimensional-manifolds/

It turns out (from the work of Arnold) that the weather follows geodesics on this configuration space. The most remarkable thing is that the uncertainty of weather predictions follows from a curvature calculation – the hyperbolic metric on the configuration space forces the geodesics to diverge, no matter how close they started. This is exactly the problem of weather prediction – we cannot sample with infinite precision.

Thanks. That’s a nice description of a configuration space with a deterministic time evolution. Also, a nice example of how calculus makes it possible to approximate a finite but massive configuration space (in this case made up of all the molecules in the ocean) with a continuous but relatively small system (the PDEs that govern fluid mechanics).

Pingback: Optimization | The Shape of Data

Pingback: P-values | The Shape of Data

Pingback: Continuous Bayes’ Theorem | The Shape of Data

Pingback: Bayes’ Theorem And Robot Arms | Open Data Science Conferences

Pingback: Continuous Bayes’ Theorem | A bunch of data

Pingback: Goals of Interpretability | The Shape of Data