In last week’s post, I described a classification algorithm called a decision tree that defines a model/distribution for a data set by cutting the data space along vertical and horizontal hyperplanes (or lines in the two-dimensional example that we looked at.) In practice, decision trees are rarely used on their own – It’s much more common to combine them in an algorithm called a *random forest*. The word *forest* comes from the fact that the algorithm combines a large number of (decision) trees. In this post, I’ll explain how this algorithm works and how, exactly, random forests are random.

But first, I should explain where the term *tree* in *decision tree* comes from, which I didn’t want to try and squeeze into the last (already long) post. You can think of the decision tree algorithm in terms of a flow chart: You ask the first question, then depending on the answer you either move down and left or down and right to a different second question, as in the Figure on the right. Each question defines a place where two “branches” form a fork and the resulting diagram looks like the roots of a tree. (Maybe it should be drawn upside down so it looks like the branches of a tree…) For a given point, if we follow each sequence of answers down to the end of the branch, the color at the tip of the root tells us the final answer for how we should classify it.

As I described last time, the key to a good decision tree is the choice of questions, i.e. which dimension/feature we ask about, and the choice of threshold, i.e. where we make the cutoff between a yes answer and a no answer. There isn’t a standard way to determine this, but the rough idea is that the new regions defined by each question should have either a lot more greens than blues or a lot more blues than greens. That way, we’re probably making progress towards cutting the data space into regions of all blue or all green.

One common way to make this precise is by calculating the *entropy* of the regions. When you read the word entropy, you may think of something along the lines of disorder, and way we are using it here is related to this. Roughly speaking, the entropy measures how hard it will be for us to correctly predict the color of a data point based solely on the fact that it is in a given region. To calculate the entropy of a region, we’ll first find the proportion of green data points to the total number of data points. For example, if there are 10 data points and three of them are green then the proportion is .3. We then feed this number into the entropy function, which is shown in the figure on the left. The function takes the value 0 when the proportion is either 0 or 1, i.e. all the points in the region are the same color, so we can very easily predict what color any one of them will be. When the proportion is .5, the entropy function is at its maximum, so there are the same number of green and blue data points and there is no good way to make a prediction: For either choice, we will be wrong half the time.

Note that the entropy function is symmetric if we interchange 0 and 1, so it will give us the same answer if we use the proportion of blue data points instead of the proportion of green data points. If we have more than two classes, then the entropy calculation is a little more complicated, but lets not go into that now.

As we see, the entropy measures the extent to which there are more blues than greens or more greens than blues, taking on smaller values as we get closer to either all green or all blue. Going back to decision trees, the goal should be to choose our dimensions/features and thresholds so that the entropy in each of the resulting regions is as low as possible. In practice, it is common to measure the change in entropy from the original region to each of the new regions that result from the cut, rather than the absolute entropy. For each fork in the decision tree, we can then search over all the dimensions/features and all the possible cutoffs, and choose the dimension and cutoff that causes that largest drop in entropy. The algorithm stops when there are no new questions that will cause the entropy of the final regions to decrease. In particular, this will be the case if each of the final regions contains all blue data points or all green data points. (Though that’s not the only situation in which the algorithm will stop.)

This is an example of what’s called a *greedy algorithm*: At each step, the algorithm makes the decision that will maximize its immediate reward, i.e. the drop in entropy. However, as we all know, making the decisions that maximize immediate rewards do not always pay off in the long run. And indeed, this is often the case with decision trees. Ideally, we would be able to devise a strategy for choosing questions that isn’t too greedy, but is still self serving enough to come up with a good final solution.

In fact, there is another related problem: If we keep asking questions until each region is all green or all blue then we will be at a severe risk of overfitting since, as we saw last time, each of the final regions may only contain one or two data points. This is reminiscent of the nearest neighbors classification algorithm, in which we formed a Voronoi cell around each individual data point (and which we fixed by switching to the K-nearest neighbors algorithm.) For decision trees, we might be able to fix this if we stop asking questions before we get all the way to cells with all blue and all green, but then we have to decide what the stopping point is – if we stop too early then our model may be too simple to capture the structure of the data. (We may end up with underfitting!)

Random forests are the most common way to deal with these two issues. The idea behind a random forest is to generate a large number of decision trees, but to modify the way we construct the trees in two ways: First, rather than always selecting the question and the threshold that maximally reduces the entropy, or however we select them (which would produce the same tree over and over), we add an element or randomness to the process. For example, we might randomly select which dimension/feature each question will be about, but then we will still choose a threshold that maximizes the drop in entropy, over all the possible thresholds for that one dimension. (Update:) Another way to add randomness is a proceedure called bootstrapping or *bagging*, in which we train the decision trees on subsets sampled from the original data set.

Second, we will limit how deep each of the resulting trees is allowed to go, for example by placing a limit on the total number of questions each tree is allowed to ask. This means that each individual tree may not divide the data space up enough to get a consistently good answer. But the hope is that the combined information of all the trees will lead to an accurate distribution. But this raises the question: How do we combine the trees?

The standard way to combine the trees is through a voting process, reminiscent of the method for multi-class classification: Given a new data point, we will run it through each decision tree in the forest and record how many trees vote for it to be a blue point and how many vote for it to be a green point. Whichever color gets the most votes, that’s the decision that the random forest makes. As usual, we will want to understand how this affects the probability distribution that is implicit in the final classification algorithm. An example with a random forest consisting of two decision trees is shown in the Figure below.

The distributions defined by two decision trees are shown on the left and middle. The final distribution, shown on the right, comes from super-imposing the decision boundaries of the two individual trees. Each region in the resulting distribution is colored green if it is contained in green regions in both of the original two trees. It’s colored blue if it is contained in two blue regions, and grey if it is in one of each. These grey regions are ties – If I had combined three trees instead of two, there wouldn’t have been any ties, but it also would have been a lot harder to see what was going on.

In practice, most random forests combine much larger numbers of individual decision trees and operate on much higher dimensional data sets. However, the process is essentially the same: The regions in the final model/distribution come from superimposing the decision boundaries from the individual trees. We have to be careful about the curse of dimensionality of course, for example by making sure that the total number of questions among all the trees is much larger than the number of dimensions in the data space. You can kind of see in the example that the resulting distribution looks slightly more rounded than the original. In fact, it’s again reminiscent of K nearest neighbors clustering, where we ended up cutting the data space into many more regions, but used a criteria for coloring the regions that looked farther afield, and thus smoothed out some of the artifacts of noisy data. A random forest does something very similar.

So, a random forest gives us another way to combine a number of simple models/distributions into a single more sophisticated one. As with neural networks, the random forest combines the decision boundaries of all the simple models. However, there is a big difference in training: In neural networks, the simple models are all trained simultaneously, taking into account the way that they will be combined by later. A random forest, on the other hand, trains each decision tree independently.

There are tradeoffs between the two methods, and it’s tempting to assume that the more sophisticated method will always win. However, as it turns out, random forests are a very popular method because in practice, they work very well a lot of the time, and the small improvements that might come from switching to a more complicated method such as a neural network are not worth the potential head aches that come along with them.

You didn’t mention bagging at all, which is IMO a very important part of RF training.

Breiman uses the gini coefficient over entropy as the node split criterion, though I am not sure that choice matters much.

Breiman suggests not setting any limit on the depth of individual trees when doing classification, and splitting until the leafs have <5 data points in regression. The general idea/intuition is to have individual trees accept high variance to get low bias, and then to average out the variance with many individual trees, which are encouraged to be diverse via random feature selection and bagging of their input data sets.

That’s a good point. I added a sentence about bagging to the paragraph about adding randomness, with a link to the wikipedia page. I don’t want to go into too much detail about it in this post, because it’s really more related resampling, which is a topic for a later date. Thanks!

Pingback: Mixture models | The Shape of Data

Pingback: The shape of data | spider's space

Pingback: Classifying Olympic Athletes by Sport and Event (Part 2) | Matt Dickenson

Pingback: Precision, Recall, AUCs and ROCs | The Shape of Data