Before we dive into nearest neighbor classification, I want to point out a subtle difference between the regression algorithm that I discussed a few posts back and what I will write about today. The goal of regression was to find a model that allows us to predict one feature/value of a new data point based on the other features/values. So, for example, we might want to predict someone’s height based on their age, weight and some other information. The key is that height is measured on a continuous scale – it can take any numerical value in some range. What we’ll talk about today involves predicting a discrete value from a finite number of choices. This might be a yes/no question, or there might be a small number of choices that don’t necessarily fall on a number scale. The difference between modeling and classification is one of the historical distinctions between data mining (statistics) and machine learning (computer science), and stems from the fact that in artificial intelligence, the goal is for the computer to make a decision rather than just a prediction.
In this post and later on, I will use the term modeling to refer to problems where you’re predicting one or more values on a continuous scale, though it is often used in a more general sense than this. I’ll use the term classification for the problem of assigning new data points to a discrete set of classes. The difference between classification and modeling is fairly subtle, but it’s similar to another issue we’ve already seen: In the car insurance example from an earliest post, we noted that some of the features were numbers – such as age, number of accidents, and total amount of claims – while other features – such as street address, or how the customer was referred to the insurance company – would not logically be described as a number. For example, for street address, the important factor might be which neighborhood a customer lives in, and there will be a finite number of neighborhoods. The different neighborhoods could be translated into numbers but they wouldn’t logically be measured on a number scale.
In the case of modeling vs. classification, the distinction is whether the prediction that we’re making (as opposed to the features of the initial data) logically falls on a number scale or into a set of discrete/finite choices. A simpler example would be if we were trying to predict a customer’s behavior at lunch time. If our goal was to predict how much money a given customer would spend to eat out at lunch, that would be modeling. If the goal was to decide what type of restaurant they would go to, that’s classification. Again, the difference is that there is no natural way to assign numbers to the different types of restaurants so that the relationships between the numbers represent the relationships between the restaurants.
The simplest classification algorithm is the Nearest Neighbors algorithm (though much more common is the K-Nearest Neighbors algoirhm (KNN) which we’ll look at in a couple of weeks.) In fact, this is an algorithm that you may use yourself without realizing it. Lets consider the example of predicting where someone will go to lunch, or rather we’ll think of it as deciding where to recommend that someone who’s visiting town should go to lunch. You could simply recommend your own favorite restaurant, but there’s a reasonable chance that the other person has different tastes than you. A more objective approach would be to think of all the local people that you know and try to decide which local person is most similar to the visitor. The similarity may be based on what you know about their eating preferences, or about other indirectly related or unrelated information, but however you choose the most similar local, you should recommend that person’s favorite restaurant to the visitor.
A computer can very easily do the same thing. We would need a data set of local lunch goers – Each data point would be a person, and the coordinates/features of each point would represent different attributes of that person, such as age, annual income, preferred spice level and any other values that might help us predict their favorite restaurant. Each data point also has a label, which in this case is the person’s favorite restaurant. We can then determine how “similar” two people are by calculating the geometric distance between the corresponding data points, the same distance that we calculated when doing regression earlier. Note that deciding how to calculate this distance, is a major and non-trivial step, related to how much importance we place on each feature/dimension. This is called normalization or feature selection/extraction/engineering. There are also ways of calculating distance involving non-numeric features. I will explain this in my next post, but for now lets assume that we have a reasonable notion of “distance”.
To predict a new person’s favorite restaurant, we will use their information to determine a new data point, then find the point in the original data set that is closest to this new point and choose the corresponding person’s favorite restaurant (i.e. their label). In other words, we classify each new person by assigning them to a class, with one class for each restaurant. This is the Nearest Neighbor algorithm.
Keeping with the usual theme on this blog, lets see if we can understand the geometric picture, i.e. the distribution that is implied by the Nearest Neighbor algorithm. We’ll start with two points, each with a different label (i.e. a different favorite restaurant), and we’ll draw the picture in two dimensions, as on the left of the figure below, so that we can visualize it. To make things easier to see, we will translate the classes into colors, in this case blue and green. So the color of each data point in the figure indicates which of the two classes it is in.
In two dimensions, there is a line of points exactly half way between our two data points. In other words, any point on that line is equally similar to both the blue data point on the left and the green data point on the right. If this was three-dimensional data then there would be a plane between the two points, and for higher dimensional data, there would be a hyperplane between the two points. But the important thing about this line/plane/hyperplane is that everything to the left of it will be closer (more similar) to the blue point (and thus get labeled blue), while everything to the right will be closer to the green point.
So the grey probability cloud that represented our distribution when we were doing regression has been replaced with a multi-colored patchwork. That’s because we’re no longer attempting to predict what types of data points can appear in our distribution (predicting one variable based on the others), but attempting to classify any point that does occur. There is probably some underlying distribution of all the points, but the nearest neighbors algorithm does not take it into account. Notice also that the transition between green and blue is very abrupt. If we wanted to be more precise, we might measure the distance from a new data point to the transition from green to blue, and use this to measure our “confidence” that we made the right choice. This can be thought of as having a slow transition from blue to green rather than an abrupt change.
In the middle picture, we have three data points. Between each pair of points is a line indicating all the new points that are the same distance from the two data points. These three lines come together in a point that is the same distance from all three data points. If we have more than three data points, as in the picture on the right, then we will see this same type of behavior repeating, with lines between pairs of points, coming together three or more at a time. The lines around each data point make a polygon called a Voronoi cell, made up of all the points that are closer to this data point than to any other data point. Each Voronoi cell gets a label (either blue or green in this case) and any new point that shows up in that Voronoi cell gets that label. So the distribution defined by Nearest Neighbors in two dimensions is made up of these Voronoi cells.
In three dimensions, things are only a little more complicated. For three data points, we get three planes, one between each pair of data points. These three planes come together in a line of points that are the same distance from all three data points. If we have four data points then we get four such lines (one for each data point that we leave out) and they will meet in a point that is the same distance from all four data points. The planes defined by pairs of these four points also meet at this point. If you were to stand at one of these points and look around, you would see a number of polygons, one between you and each neighboring data point. These polygons meet along their edges, with each edge along a line defined by your data point and two others. Thus the Voronoi cells around three-dimensional data points are polyhedrons, such as tetrahedrons, cubes and pyramids, but also much more complicated and much less symmetric ones like the one shown in the Figure to the right.
In higher dimensions, the Voronoi cells will be higher dimensional versions of polygons and polyhedrons. These are called polytopes in general. But rather than trying to visualize what they look like (There are tricks to do this for four-dimensional polytopes, but after that it gets ugly pretty fast.) I recommend just keeping the two- and three-dimensional pictures in your mind when you think about the higher dimensional cases. As always, this is slightly risky because of the curse of dimensionality, but it’s better than nothing.
There are two important things to note about the Nearest Neighbors algorithm. First, you can see that in the picture of the Voronoi cells above, there is one blue point that is isolated from the rest. This could be a result of noise, or an error in the data. (Maybe someone accidentally checked the wrong box under favorite restaurant when they filled out their questionnaire.) On the other hand, it might represent an actual structural feature of the data. Since there’s only one data point representing it, it’s probably noise, but the Nearest Neighbors algorithm doesn’t take that into account. So points near that one noisy data point will probably be misclassified. Two posts from now, I’ll discuss K-Nearest Neighbors, which will address this problem.
Second, note that Nearest Neighbors does not require us to initially assume anything about the distribution of data. This is in contrast to regression, in which we had to make a choice about what type of curve to fit to the data. This flexibility comes with some major trade-offs: On the one hand, flexibility can be very useful in the real world, where data often doesn’t fit a nice simple model. In such a situation, nearest neighbors should still give us a reasonable answer. On the other hand, finding a simple model, as in regression, gives us nice summary of the data that a human might be able to understand and interpret. A simple model also allows us to make fast, computationally inexpensive predictions. Nearest neighbors requires us to keep the original data on hand and depending on how we define the distance between points, it can be very slow to compare each new data point to each of the original data points. This type of trade-off is essentially unavoidable in the world of data analysis and each of the algorithms that we’ll look at in upcoming posts will implicitly choose a different balance between flexibility and simplicity/efficiency.