Optimization is a topic that has come up in a number of posts on this blog, but that I’ve never really addressed directly. So, I though it was about time that I gave it its own post. The term “optimization” has a number of different, often very specific meanings, and I’m probably not going to do any of them justice. Instead, I’ll start with a fairly vague sense of the term: optimization meaning to pick one option out of some collection of possible options in a way that maximizes (or minimizes) some measure. Or in other words, finding the “best” option, with respect to some carefully defined meaning of “best”. But as you’ll see below, we’ll actually be interested in a much more specific definition of optimization.

Lets start off with an example: Imagine you’re designing a robot with a solar powered battery charger. Whenever the robot’s battery is low, you want it to stop whatever it’s doing and find a sunny spot to recharge its batteries. How would you design this part of the robot’s behavior?

There are obviously many different possibilities and I’m going to describe one that will foreshadow later ideas in the post. But first note why this counts as optimization: The collection of possible options is all the places that the robot can stop. The measure that we’re trying to maximize is the amount of sun light in the location.

We’ll put four light sensors on our robot, one at each corner of its square base. So, if the robot is in a patch of shade, but one of its sides is closer to the edge of the shade, the light sensors on that side of the robot would read a slightly higher level of light, as on the left in the Figure below. Since the robot will want to move out of the shade, whenever one of its light sensors reads a higher level of light than the others, the robot should move in direction of that sensor. If it ever gets to a spot where all four sensors are reading the same level of light, it will stop, since it has no reason to pick any one particular direction.

A similar problem would be to design a robot that tries to find the highest altitude in a hilly terrain, as on the right of the Figure. In fact, from an abstract perspective, if we imagine plotting the light level as a function, these two problems are essentially the same: the highest level of light would be the highest point on the hill defined by the function. If the robot is on the slope of a hill, then the altitude sensor that reads the highest altitude would be on the side of the robot that points up the hill. So in order to move up the hill, the robot should always move towards the sensor that reads the highest altitude/light level.

Each of these two robots is mimicking an algorithm called *gradient descent*, which gets its name from a mathematical tool called the *gradient* that you might have learned about in multi-variable Calculus. Given a two-dimensional function, the gradient defines a vector that points in the direction in which the “hill” is steepest. Technically, the robots are mimicking gradient *ascent* since they’re trying to go up instead of down – In principle, gradient descent and gradient ascent are both the same thing, but people usually refer to it as gradient *descent* because it’s traditional to try to minimize a function (such as some sort of cost).

The gradient descent algorithm works the same way these robots do: Given a function, start at some initial point and figure out in which direction the function is “steepest”. Then move the point in that direction and repeat the process. Eventually (if you set things up correctly) you’ll get to a point where the function is close to flat. When that happens, you stop and choose whatever point you ended at.

Before we get into how this applies to data analysis, there are a few things I should point out: First, as we’ve noted, either of our two robots will stop when all four sensors read the same. For the hill-climber, this means that it’s on a horizontal patch of ground – if it’s lucky, this will be at the top of the highest hill. (As a side note: If you took calculus, you may recall that maximum and minimum points of a function occur when the derivative/slope is zero/horizontal. With the robots, this is the same idea but in two dimensions instead of one. In fact, you calculate the gradient by combining two (partial) derivatives.)

Second, note that if there are multiple hill tops in the terrain, or multiple light spots surrounded by shade, as in the Figure below, then our robot could easily end up stuck in a spot that is not the overall best, but is better than any of the nearby spots. If our robots had cameras and could get a view of the distant terrain, they might be able to figure this out and move to a better spot, but in actual gradient descent problems, that usually isn’t a possibility. Optimization problems in which there is exactly one maximum (or minimum if that’s your thing) are called *convex problems*, and these tend to be very easy to solve. In particular, gradient descent will always lead you to the solution to a convex problem, and in some cases you can avoid gradient descent all together and calculate the solution directly. (More on that later.)

Third, note that while our robots are moving around in a two-dimensional landscape, the gradient is defined for as many variables as we want, and we can carry out gradient descent for any function of any number of variables. (It’s just harder to visualize.)

But why would we want to? This blog isn’t about designing solar powered robots; it’s about data analysis, and the optimization problems that we’ve discussed in past posts seem very different. For example, in regression, we want to choose a probability distribution that maximizes the total value of all the points in our data set. This differs from the robots in two ways: First, we’re calculating the values of a whole set of points instead of just one. Second, we’re moving the function (i.e. the probability distribution) around while leaving the points fixed. Most of the other types of optimization had a similar feel.

This is where the magic of configuration spaces comes in. Recall that a configuration space is a way of thinking about all possible configurations of a complicated system as a single object/space, in which each configuration corresponds to a single point in the space. In the configuration spaces post, we mostly thought about abstract configuration spaces that we could only implicitly study by taking data samples. But there are also very simple, concrete configurations spaces.

For example, consider the configuration space of all lines in the plane. Every line is defined by a slope and a y-intercept. That’s two numbers, which we can think of as coordinates defining a point in the plane. In other words, the configuration space of lines in the plane is itself a plane. (Technically this leaves out vertical lines, but lets not worry about that right now.)

Each line also defines a probability distribution concentrated along the line. In the post on linear regression, I described how we can calculate the likelihood of each probability distribution by adding up the values of the data points with respect to the distribution. We can think of this as a function on the two-dimensional plane of probability distributions concentrated along lines.

Recall that the regression line is simply the line for which the corresponding probability distribution has the highest likelihood calculated in this way. From the perspective of configuration spaces, this line is a point where the expectation function – a function on the plane like the light level or the elevation that our robots were interested in – is maximal. So by calculating the gradients of this function and carrying out gradient descent, one can find the best regression line the same way the robots found their ideal spots.

As it turns out, though, linear regression is a convex problem; there’s always a single local maximum. Moreover, the equation for the expectation, and the resulting equation for the gradient are simple enough that this maximum point can be calculated directly, without bothering with the repeated process that gradient descent requires. However, many of the other optimization problems that come up in data analysis – notably linear regression, and expectation maximization – require proper gradient descent.

So gradient descent (on configuration spaces) ends up playing a very large role in data analysis. In fact, now that you know what to look for, you can go back through the earlier entries on this blog, and see if you can spot all the places where gradient descent was hiding in the background. (I haven’t done this myself, but I bet it’s in most of them.)

One of the most entertaining optimization problems I encountered was when I implemented Poincare duality on triangulated manifolds. For very small manifolds the algorithm worked fine, but once we got to around 6 4-dimensional simplices, the algorithm started munching up all the system memory. So I traced-through the computations and discovered my Smith Normal Form algorithm was doing row operations on matrices that started off fairly sparse (homology computation matrices) and ended with matrices with numbers having thousands of digits. The computations were using arbitrary-precision integers and those were eating up the memory. So I modified the Smith Normal Form procedure to look for row operations that locally minimize the matrix entries. Ever since then, in these computations the matrices seem to more or less sparse throughout the entire computation, whether for small or large triangulations.

Pingback: Genetic algorithms and symbolic regression | The Shape of Data

Pingback: Genetic algorithms and symbolic regression | A bunch of data