Two posts back, I introduced the Nearest Neighbor classification algorithm and described how it implicitly defines a distribution made up of Voronoi cells around the data points, with each Voronoi cell labeled according to the label of the point that defines it. We saw that this algorithm is very flexible because it does not assume that the data fits a specific model. However, it is also vulnerable to noise: If there are a few mislabeled points in the initial data set then new points near these will be misclassified. This can be thought of as a form of overfitting. In this post, I’ll introduce the K-Nearest Neighbors (KNN) algorithm and explain how it can help to reduce this problem.

Recall that for the Nearest Neighbor algorithm, we classified a new data point by calculating its distance to all the existing data points, then assigning it the same label as the closest labeled data point. The example that I used was to predict which nearby restaurant a given person visiting a new town would like most. In this case, the Nearest Neighbor algorithm would correspond to thinking of all the local people you know who have eaten at nearby restaurants, deciding which of them is most similar to the visitor who is looking for a place to eat lunch, then recommending that the visitor go to this person’s favorite restaurant.

The KNN algorithm is very similar, except that we start by choosing a number K. When a visitor asks for a lunch recommendation, you again think of all the people you know who have eaten at the nearby restaurants. But instead of choosing just one person who seems most similar to the visitor, you choose K people who are most similar to the visitor. They will probably have a few different favorite restaurants between them, but hopefully one will be the most popular within this group, and that’s the one you recommend. (Lets not worry about the case of a tie for now.) We would expect this algorithm to be more effective because it reduces the influence of oddballs in the data set.

We translated the Nearest Neighbor algorithm into a geometric picture in which each of the original data points (the locals who know their favorite restaurants) was surrounded by a *Voronoi cell* of all the points in the whole space that are closer to it than to any other data point. This is shown on the left in the picture below for a relatively simple data set.

Recall that the edges around each Vornoi cell are made up of the points that are an equal distance between two data points. For the KNN algorithm, we have a very similar picture (shown on the right of the Figure), but this time the edges between the cells are no longer just defined by the two closest points. The picture on the right is (an approximation of) the resulting distribution when K=3. In higher dimensions, there is a very similar picture, but with planes/hyperplanes instead of lines. I won’t try to draw it, but we can get a pretty good idea of how higher dimensional KNN works while just picturing the two-dimensional case.

The picture indicates two important differences between the Nearest Neighbor algorithm and the KNN algorithm. First note that while there are a lot more lines involved in the picture on the right, the overall structure of the distribution is simpler: There’s no longer a hole in the middle of the green region and a big bump in the line between the green and blue. So, switching from Nearest Neighbor to KNN has the effect of “smoothing out” the distribution. Of course, it is only smooth in the large scale – There are still corners in the walls between the two regions.

Second, note that the lower right blue point is now entirely in the green region. So if we were to make a lunch recommendation for the person corresponding to this data point, Nearest Neighbors would get it right, but KNN would get it wrong. On the other hand, if this data point is an anomaly/oddball or the result of noise or an error in data entry, then KNN will probably be more accurate for new data points near that one. This is exactly what we mean by overfitting: when a model is very good a predicting the existing data but very bad at predicting new data. We should never expect a model to be 100% accurate with real world data, since we can’t expect the geometric structure of the data to perfectly reflect reality. So it’s OK if our model misclassifies a few data points.

Switching to KNN reduces the risk of overfitting, but it adds the complication of having to choose the best value for K. In particular, if we have a very large data set but we choose K to be too small, then we will still run the risk of overfitting. For example, for K = 3 and a really large data set, there’s a reasonable chance that there will be two noisy data points that are close enough to each other to outvote the correct data points in some region. On the other hand, if we choose K to be too large, then we may end up smoothing things out too much and eliminating some important details in the distribution. In the extreme case, if we choose K to be equal to the number of data points, and there are more blues than greens, then the whole distribution will be blue.

In other words, we need to choose K large enough to avoid overfitting, but small enough to avoid oversimplifying the distribution. This is similar to the trade-off we faced when deciding how many parameters to include in our regression curve – More parameters would allow for a more detailed distribution, but would increase the risk of overfitting. In fact, this is just another version of the same basic question at the heart of all data analysis – How do we decide which of the possible distributions suggested by the data set is the correct one?

There are many different methods for finding a good K, including heuristics (rules of thumb) for calculating K based on the number of data points in the data set. Another approach is to divide your data into training and evaluation sets and choose the value of K that does the best job of classifying the evaluation data set based on the training set. (I discussed this approach for regression as well.) Or, depending on the context in which you’re using KNN, you may be able to use some other kind of trial and error approach.

In practice, KNN (like Nearest Neighbor) is too computationally expensive for many applications, particularly for large data sets. It is at the extreme end of the tradeoff between computational efficiency and model flexibility. It is often necessary to use a classification method that replaces the flexibility of KNN with a more rigid but computationally more reasonable model. In upcoming posts we will look at some other algorithms that fall closer to the middle of this scale, but in each case we will see the same tradeoffs between computational efficiency vs. model flexibility and between accuracy vs. avoiding overfitting.

Pingback: Linear Separation and Support Vector Machines | The Shape of Data

Pingback: Kernels | The Shape of Data

Pingback: Decision Trees | The Shape of Data

Pingback: Random forests | The Shape of Data

Pingback: Intrinsic vs. Extrinsic Structure | The Shape of Data

Pingback: The shape of data | spider's space

Pingback: Graphs and networks | The Shape of Data

Pingback: Cast study 2: Tokens in census data | The Shape of Data