In the last post, on nearest neighbors classification, we used the “distance” between different pairs of points to decide which class each new data point should be placed into. The problem is that there are different ways to calculate distance depending, for example, on how we compare different dimensions/features. To illustrate the point, consider the example from last time of a nearest neighbors algorithm to make restaurant recommendations. Two of the factors we discussed were annual income (which will be useful for guessing how expensive the restaurant should be) and preferred spice level, say on a scale from 0 to 5. The problem is that because spice level is measured on a much shorter scale than income, two people with spice levels 0 and 5, but incomes that are different by $1,000 will be considered closer than two people with the same spice level but incomes that are different by $2,000. To make the algorithm more effective, we need to adjust things so that differences in income don’t outweigh differences in spice level (or vice versa).
One way we might do this would be to divide everyone’s salary by $10,000 (in the data set, not in real life!) so that a difference of one spice level would have the same impact on the distance as a difference of $10,000 in salary. This will make our notion of distance a little more reasonable. However, I just chose the number $10,000 based on intuition – it might have been better to scale by a larger number or a smaller number. We really need a more objective way to choose how much to scale each dimension by, and there are a number of possible choices. But before we look at some of them, lets consider the geometric picture.
Dividing the salary dimension by 10,000 has the effect of shrinking the data set in the direction corresponding to salary. That’s because for two people who have all the same values, but their salaries are different by $10,000, the distance between them will become 1 instead of 10,000. You can see the effect of this on the Voronoi cells for one example below. As expected, this makes a drastic difference in the middle of each picture, where it switches from green to blue depending on how we scale the dimensions.
Scaling the dimensions will have a similar impact on most of the data analysis algorithms that we’re going to look at, though it’s most obvious for Nearest Neighbors. So choosing the right scale for each dimension (i.e. how much to multiply it by) is a vital step in the process of preparing data. One common approach is to scale things so that the possible values for each variable are between 0 and 1. For example, we would divide each person’s spice preference by 5 so the possible choices become 0, .2, .4, .6, .8 and 1, rather than 0,1,2,3,4 and 5. If we were keeping track of people’s ages, we might subtract 18 from each person’s age, then divide by 82 so that someone who is 18 gets the value 0 and someone who is 100 gets the value 1. This leaves out some potential ages, but it seems reasonable that most people who are choosing restaurants would be between the ages of 18 and 100. A slightly more objective approach would be to look at the highest and lowest values for each dimension in our initial data set and scale so that the lowest becomes 0 and the highest becomes 1.
Things get more complicated if we try to do this for annual income. Incomes can range from zero to hundreds of millions (or perhaps more). However, the range in which most people would fall is much much smaller than the entire possible range. So if we have a single person in our data set with an annual income of $100 million, and we divide by this amount, then almost all of the points in our data set will end up between 0 and .001, which will make the income dimension essentially useless.
One way to minimize this issue is to scale by dividing by the variance of the values in a given dimension, rather than the entire range. The variance of a set of numbers measures how spread out they are in a way that takes into account all the numbers, not just the highest and lowest. So, if there are one or two numbers that are much higher or lower than the rest, the variance minimizes the impact that these outliers have on the scaling factor. For a detailed description of variance (and examples), see the wikipedia page for it.
In fact, one could go a step further by noting that the principal components defined in my previous post on PCA are the directions (not necessarily parallel to features) in which the variance of the data set is maximal. If we scale the data along the principle component directions so that each principle component has length one, then the data will be very evenly round. This is a more complicated linear transformation than just scaling individual features, and it leads us in the direction of kernels, a topic I will write about in much greater depth later on.
There are also more interesting ways to measure the “distance” between data points than the standard vector distance (i.e. Euclidean distance.) For example, when comparing customers of an online store, one might define the similarity between two customers as the number of types of items that both customers have bought in common divided by the total number of types of items either of them has bought. This value is equal to one if two customers buy the exact same thing and 0 if they haven’t bought any of the same things. (Also, dividing by the total number of purchases takes into account the fact that some customers buy more than others.) If we define the distance to be one minus the similarity, then two customers with more of the same purchases will be closer to each other than two customers with only a few purchases in common.
For now, though, the key is to note that there are a lot of options and some very important decisions to make, even with an algorithm as simple as Nearest Neighbors clustering. There is no one best way to scale the different dimensions: The best choice for how to normalize your data will depend on the type of data, what you plan to do with it and how you will evaluate the results.