In the last few posts, we explored how neural networks combine basic classification schemes with relatively simple distributions, such as logistic distributions with line/plane/hyperplane decision boundaries, into much more complex and flexible distributions. In the next few posts, I plan to discuss other ways of doing this, starting this week with decision trees. These are a general framework that can be used for regression (predicting the value of one feature/dimension based on the values of all the others) or for classification (predicting the class of a new data point from a finite set of possibilities). But to keep things simple, I will focus on using them for classification in this post and the next.
The idea behind a decision tree is very simple, and can be compared to a game of 20 questions. Lets go back to our original classification example: Someone who is new to town has asked you for a restaurant recommendation and you want to use information about them to decide which restaurant to suggest. In 20 questions, we can ask the person yes/no questions, such as Do you like spicy food? Are you willing to spend more than $15? Are you in a hurry? The answer to each question should both give us a better idea of the best answer, and can also be used to determine what the next question will be. In other words, if they answer yes to the first question then the second question might be different than if they had answered no. At some point, we will get to the point where asking more questions won’t narrow things down any more, and at that point we make the best guess based on what we’ve learned.
So, how do we translate this into an algorithm? First, we will translate each yes/no question that we might ask into a number on a scale. For example, instead of asking whether the person likes spicy food, you could record their preferred spice level, say on a scale from 0 to 5. Then the original question becomes something like Is your preferred spice level greater than 2? Other questions might be What is your lunch budget? How much time do you have? The answer to each of these questions is on a number scale, and can be converted into one of the original yes/no questions by picking a specific cut off and asking whether the number is higher or lower than that cutoff. But here’s the important question: How do we decide which questions to ask, which cutoffs to pick, and what the final answer will be?
Following the theme of this blog, we’ll look for answers by translating the problem into geometry. We’ll start by asking a number of people who have tried some nearby restaurants to rate themselves on these different scales, then ask them what their favorite was. We can then think of each person’s responses as the coordinates of a data point, which we will draw in the data space, choosing the color of the point based on their favorite restaurant. If we were to do this with two questions, we would have two-dimensional data, which might look like the data points on the left in the Figure below. (There are only two colors, so this town only has two restaurants.)
Like the other classification algorithms that we’ve looked at, the decision tree algorithm will divide data the space into a number of regions and then color each region blue or green. To decide if a new person would like a given restaurant, the algorithm just needs to determine whether their data point is in a blue region or a green region. Before we try to understand how it decides the color of each region, we need to figure out what the regions look like and we’ll do this by figuring out when two data points will end up in the same region.
This is pretty straightforward: two data points will end up in the same region if they give the same answers to all the questions. We already know what the coordinates of the new point are. We ask the point questions by checking whether a certain feature/coordinate is greater than or equal to a certain value. Lets go one question at a time, and to make things simple, sticking with our two-dimensional data for now. If the first question is about the y-coordinate, then it divides the data into two sets: all those with y-coordinate above the threshold and all those below.
The decision boundary is the set of all points whose y-coordinates are exactly equal to the threshold, i.e. a horizontal line like the one shown on the left in the Figure above. One region is all the points above this line and the other is all the points below. If the question was about the x-coordinate, we would have a similar picture, but with a vertical line instead of a horizontal line. Notice that for the horizontal line that we chose, there are mostly (though not all) blue points above the line and mostly green below. So the line does a good job of narrowing things down, but to narrow it down completely, we’ll need to ask more questions.
On to the second question. Remember that this one will depend on the answer to the first question, so the set of points above the first decision boundary will be asked one question and the points below the decision boundary might be asked a different one. The question that we ask the lower region will divide it along a second horizontal or vertical line and a second vertical or horizontal line will divide the upper region, as on the right in the picture above. These questions happen to be about the x-axis, producing vertical decision boundaries, and note that the upper vertical line is different from the lower one.
The third round of questions allows us to divide each of the resulting four regions along a horizontal or vertical line and so on. In this case, three of the regions are already completely blue or completely green, so we don’t need to ask any more questions. The upper left region, however, has one blue point and one green, so we can ask it one more question, dividing it into two regions for a total of five, as on the left in the Figure below. In general, though, we may need to ask more questions. (And unlike in the game, we’re allowed to go past 20.)
Once we’ve determined all the questions, we need to color each region either green or blue. Again, this is pretty straightforward: Each region contains some number of blue and green points from the original data set, and we’ll color the region based on whichever color is more common. This is reminiscent of how we colored the regions in KNN, though the shapes of the regions determined by KNN are very different. In this example, each region is all green or all blue, but depending on the particular algorithm you use to construct your decision tree, this may not be the case. In theory, there may be some regions with an equal number of each color, but as I hinted at above, one generally chooses the decision boundaries so that each region has either mostly blue or mostly green data points.
As usual, it is much harder to visualize the regions defined by a decision tree for higher dimensional data, but we can use what we know about the two-dimensional case to get a rough idea. The decision boundaries in higher dimensions will be planes (in three dimensions) or hyperplanes (in four or more dimensions.) They will be “horizontal” and “vertical”, in some sense, so they will cut the data space into regions that are cubes (in three dimensions) or higher dimensional cubes in general.
This picture is reminiscent of the picture from the hierarchical approach to multi-class classification I discussed a few weeks ago. In fact, they’re essentially the same, but with two important differences: First, the decision boundaries in the multi-class classification scheme were allowed to be more complicated than just vertical and horizontal lines/planes/hyperplanes. There are versions of the decision trees algorithm that use general regression at each each step, creating more interesting decision boundaries, but from what I can tell, these are less commonly used. (If you happen to know this isn’t the case, please leave a comment below!) In particular, one of the main appeals of the constrained type of decision tree described above is that they can be expressed as a number of very simple rules that can be understood by a person.
The second difference is that each of the final regions in the multi-class schemes ended up with a different label. In the decision tree, however, we’ll usually have multiple regions with the same class. This allows the algorithm to define more sophisticated distributions, somewhat reminiscent of the way that we saw neural networks combining simple distributions. However, in the case of the decision tree, the combination is all done at once, rather than incrementally by the later layers of neurons.
Also, you may have noticed that I’ve been rather vague about exactly how we choose each question in the tree. That’s because there isn’t a single methods that is well agreed upon, and in fact, there are a number of very different criteria that have been proposed. There’s a good reason for this, but this post is already long enough, so it will have to wait until next week.