Genetic algorithms and symbolic regression

A few months ago, I wrote a post about optimization using gradient descent, which involves searching for a model that best meets certain criteria by repeatedly making adjustments that improve things a little bit at a time. In many situations, this works quite well and will always or almost always finds the best solution. But in other cases, it’s possible for this approach to fall into a locally optimal solution that isn’t the overall best, but is better than any nearby solution. A common way to deal with this sort of situation is to add some randomness into the algorithm, making it possible to jump out of one of these locally optimal solutions into a slightly worse solution that is adjacent to a much better one. In this post, I want to explore one such approach, called a genetic algorithm (or an evolutionary algorithm), which I’ll illustrate with a specific type of genetic algorithm called symbolic regression. I first heard about this in an article about a small company in Somerville, MA called Nutonian that has built a whole data science platform around it.

The goal of symbolic regression is to optimize a non-linear regression model. The outcome of the process will be an equation that defines a function from our input features to an output value. For example, if we have two input features – $x$ and $y$ – then the model may look something like $m = 3x + y^2 + 4x^3y + \log(y)$. To measure the quality of a given model, we would use it to predict values for a training set, then add up the squares of the differences between the predicted values and the actual values. We want a model for which this sum is as small as possible.

Notice that the plus signs in the equation naturally decompose it into pieces. In my earlier post on non-linear regression, we approached this problem by deciding what kinds of pieces we wanted to consider – for example, all polynomials with powers up to 3 – and then using standard optimization (gradient descent, though I didn’t call it that at the time) to select the best coefficients. This poses the risk of getting stuck in a locally optimal solution, but there’s also another problem: When we add these different types of elements into the equations, we’re effectively creating a higher-dimensional problem. If we want to allow a lot of different type of elements in the final equation, things can become computationally unruly. For example, if we start with 10 features instead of 2, then there are 59,048 possible polynomials with powers up to 2. If our data has 100 features then the number is roughly a 5 with 47 zeros behind it.

But the subtle point here is that even though we want as many options as possible for the types of elements allowed in our solution, we’ll usually want the solution itself to contain only a small number of elements from this large set of possibilities. This, as it turns out, is also something that a genetic algorithm excels at.

As the name suggests, the idea behind a genetic algorithm is to mimic the natural selection process that produces genetic codes, i.e. DNA. It’s common to think of DNA as a collection of genes whose presence or absence determines the traits of a plant or animal. This is a gross oversimplification of the biology, but that’s never stopped us before.

To solve an optimization problem with a genetic algorithm, you start by defining the genes of the thing that you want to optimize. For symbolic regression, the “genes” are simply the pieces of the equation between the plus signs. So the equation $m = 3x + y^2 + 4x^3y + \log(y)$ from above has four genes – $3x$, $y^2$, $4x^3y$ and $\log(y)$.

In the biological world, we get new genetic codes when existing codes combine with each other, mutating slightly in the process. We will once again ignore the biological complexity and define our own simple ways to combine and mutate the things that we’re interested in. There are a few different ways we could combine two equations, such as randomly picking a few elements from each of the original equations and combining them. So if we start with equations $m = x + 3y^2 + \sin(x)$ and $m = y + \log(x^2)$, we might combine them to form $m = 3y^2 + \sin(x) + \log(x^2)$.

Next we need a way to “mutate” our equations, and again there are some fairly simple options. For example we could randomly delete and/or add a small number of elements to the equation, or we might adjust one or more of the coefficients slightly. The combined equation might become $m = 3y^2 + 1.5\sin(x) + \log(x^2) + x^3$. Note that even if each equation that we start with only has a small number of elements, we can randomly choose elements to add from an arbitrarily large set of options without running into computational problems.

Once we know how to combine and mutate our regression models, we can start the Darwinian process. We’ll begin by randomly creating a small initial “gene pool” of regression models. Then we’ll take a pair of models, combine them, mutate the result, add it back into the original gene pool and repeat.

The question, of course, is how to decide which pairs of models to combine each time. Survival of the fittest implies that only the “best” models should be allowed to pass on their genes, and as described above, we have a way to determine how “fit” each model in our gene pool is. However, if we always combined the best two models, we would likely fall into the very trap that we were trying to avoid; even though there would be some amount of randomness from the combination and mutation, for the most part we would get lots of models similar to the top two of the models we started with. These might all be centered around a locally optimal solution that is not globally optimal.

So to fix this issue we’ll add in a third element of randomness: Rather than always selecting the best two models to combine, we will randomly pick them, but with a bias that makes the better models more likely to get selected. That way the models in the gene pool will tend towards better solutions, but they will also spread out randomly and explore possible solutions that are quite a bit different from the best solutions.

An additional advantage of this approach is that it can be done in parallel; while gradient descent always looks at a single variant of the one best-so-far solution, a genetic algorithm can explore an unlimited number of different possible solutions all at the same time.

Genetic algorithms are commonly used for feature selection, where each “gene” is a feature in a high-dimensional data set, and the goal is to select a small set of features determining the best lower-dimensional representation of the data set. For this kind of application, the tough part is deciding what measure to use for the “fitness” of each set of features. I used symbolic regression as my example above because the fitness function is straightforward. Though, in fact, you can think of symbolic regression as a form of feature selection in a high-dimensional kernel space for a low-dimensional data set.

Of course, there are some optimization problems that genetic algorithms can’t help you with, particularly if you can’t break your problem down into “genes” such that a solution is uniquely determined by which genes are on or off. For example, if you’re training a complex mixture model on low-dimensional data set, a genetic algorithm probably won’t help you much. You can deal with this to some extent by running gradient descent many different times from different random starting points. (This also has the advantage that you can run these all in parallel.) There are also more general randomization techniques such as simulated annealing that don’t require a notion of “genes”, though they do have other constraints. But those will have to wait for a future post.