In the last post, I discussed how one can analyze a data set from the point of view of geometry, by thinking of each data point as coordinates in a high dimensional space. The thing is, when we’re analyzing these data sets, we’re usually not interested in the points in the data set, so much as the points that aren’t in the data set. For example, if we’re using past customer data to predict whether a new potential customer will buy a product, the new customer is not in our initial data set. So the goal of analyzing the data points from past customers is to understand the points that may correspond to new customers. The same goes for determining if a tumor is benign, if a certain blood pressure drug will work on a given patient, or any other application of data analysis.
In other words, we want to think of our data set as a small number of representatives sampled from some larger set of potential data points. So, sticking with the theme of this blog, I want to think about this set of potential data points as a geometric object that fills in the gaps between the data points that we do have. The main questions of data analysis boil down to: How do we infer the structure of this object from our small number of samples?
For example, consider the two-dimensional data set shown on the left in the figure. It looks like if we fill in the gaps between the points, we’ll get a filled in ellipse, as in the picture in the middle.
However, you might notice that the points in the data set are denser near the middle. There are two possible reasons for this. The first possibility is that our data has noise – small errors in each coordinate that were introduced when we collected the data. Because of this, some of the points in the data set may actually be outside the ellipse that we’re interested in. You can think of this as making the ellipse blurry, as on the right side of the picture.
The second possibility is that our customers (or whatever type of data we’re looking at) are more likely to come from closer to the center of our ellipse. So for example, the customers represented by points near the center may represent the core demographic. People who are similar, but not quite in this demographic are still likely to become customers, but not quite as likely. So again, we can think of this as a blurry ellipse in which the darkness of a given point indicates how likely a person with the corresponding attributes is to become our customer.
So in either case, we want to think of the underlying geometric object as a sort of cloud of probability rather a well delineated object. This “cloud” is called a probability distribution. (Technically, it’s the probability density associated to the distribution, but to keep things simple, I will just call it the distribution – See the comments for more.) It assigns to each point a value greater than or equal to 0, which we can think of as the probability that the point is in the distribution. This is indicated by a shade of gray, with black being very large values and white being 0. The elliptic distribution on the right of the Figure is a very popular one, called a Gaussian distribution.
The problem is that there are lots of different ways one could fill in the gaps between the points. In fact, the nature of probability causes there to be lots of apparent gaps between data points even where there aren’t gaps in the distribution. The second figure shows three different probability distributions that one might infer from the same data set – four points (shown in blue) that look like the corners of a rectangle. The one on the far right is probably wrong – it sticks too closely to the existing data, so it won’t correctly predict any new data points. This is an example of what’s called overfitting. On the other hand, it’s much harder to say which of the two other distributions is more reasonable. The only way we could tell if one was more accurate than the other would be to collect more data.
Any analysis of a data set will require us to make a decision like this at some point. (Sometimes this decision will be implicit rather than explicit, hidden in the other choices that we make.) Avoiding overfitting is a skill that anyone who works with large data sets needs to learn, primarily from practice and experience.
Of course, if you’re analyzing a high dimensional data set, you can’t look at the data the way we did here, so it’s much harder to decide if a given distribution is reasonable. Instead, it’s up to the algorithm. Any algorithm that you choose has a different way of inferring a probability distribution from a data set. (Again, constructing a probability distribution will often be done implicitly, hidden beneath the surface.) Each algorithm has a different way of avoiding overfitting, either built into it or built in to the instructions for using it.
When we look at different tools and algorithms in upcoming posts, I will describe them in terms of the properties of the probability distributions that they (implicitly or explicitly) construct and what these properties tell us about the output that the algorithms produces.