Now that I’m finally getting the hang of my new job, I thought I’d try to get back to blogging regularly. It’s been very interesting seeing things from the perspective of software engineering, which is all about building systems around the algorithms that come out of computer science, statistics, mathematics, etc. In particular, the goal is usually to use the simplest suitable algorithm for a job, since you know it’s likely going to be just one piece of a large, complex system and simpler algorithms are easier to write, to debug and to maintain. In particular, it’s often tempting to use heuristics when a theoretically sound algorithm is unknown or would be overly complex. (Standard disclaimer: These comments represent my own opinions and have not been approved or endorsed by my employer.)
By a heuristic, I mean a simple algorithm that makes intuitive sense, but may only give an approximately correct answer. As an example, heuristics often come up when turning a vector data set into a similarity graph. Once we’ve calculated the distances between pairs of data points, we want to connect each data point to its K nearest neighbors for some number K. But how do we choose the value of K? Ideally, we would always have a theoretically sound method for choosing K based on what we know about where the data came from and what we’re going to do with the resulting graph. However, in practice the more common technique is to use a generic heuristic that can apply to any data set. For example, we might let K be the log of the number of data points in the set. This value is much easier to calculate, and seems to work reasonably well in most cases.
The advantages of heuristics should be obvious: they tend to be simple to explain/understand and simple to code. The problem, of course, is that without a theoretical justification, it can be hard to tell if your heuristic is really as good as you think it is. For example, in some informal experiments a few years ago with different graph clustering algorithms and different values of K, my collaborator Sean Bowman found that in many data sets, the value of K that gave the best results (i.e. the discovered clusters are closest to the actual labels) was often far different from the value suggested by the standard heuristic.
Now, in this case, the heuristic will often be necessary because there may not be any theoretical basis for choosing a particular value of K. However, there are other situations where a theoretically sound method exists, but is too slow or too complex. In these situations, you actually have a choice to make, between an impractical approach that you are sure will work or a practical approach that may not.
In a number of cases like this, it turns out, it’s possible to come up with an algorithm that is simple enough that it feels like a heuristic, but still manages to closely approximate the behavior of the theoretically sound behemoth. The computer science literature is sprinkled with these little gems, but for this post I want to focus on one (not particularly profound) example, just to get the point across.
Lets say we have a database with a million entries, and we’re given a criteria that allows us to label each entry as either type A or type B. We don’t know how many of each type of entry there are in the database and, moreover, calculating which type an entry is takes long enough that we don’t want to do it for the entire data set. So here’s the question: How can we generate a random sample of 50 entries of type A?
The theoretically sound solution to this problem would be to go through the whole database, count the entries that are of type A, pick 50 numbers between 1 and N (where N is the number of type A entries), and take the corresponding type-A entries to be our sample. However, this requires that we calculate the type of every single entry in the database, and we said we wanted to avoid doing that.
So lets try a heuristic solution: We’ll repeatedly pick a random number from 1 to 1 million, then determine the type of the corresponding entry in the database. (Lets assume we have a system in place to guarantee we never pick the same entry a second time.) If the entry is of type A, we’ll add it to our sample. If it’s of type B, we’ll throw it away and move on to the next random number. We’ll keep doing this until we have a sample of 50 type A entries.
First notice that (in most cases) we won’t need to check the types of all entries in the database. If we’re unlucky enough to get a lot of type Bs before we get our 50 type As, then we’ll have to do a lot of checking. In fact, if there are fewer than 50 type A entries, we’ll have to go through the whole database to get our (partial) sample. However, if there are a reasonably large number of As, then we should do OK. For example, if half of the entries are type A, there’s a good probability that we’ll need to check fewer than 100 entries (which is much better than 1 million.)
But there’s a more important question: Is the sample that we end up with still random, or will this approach bias the sample in some way? For example, will entries that are earlier in the database be more likely to show up in the sample? Less likely? Your immediate intuition should be to answer no; there’s doesn’t seem to be any aspect of the process that would favor one set of type-A entries over another. In particular, while this algorithm technically knows what order the entries appear in, it only uses this information to keep track of them. As far as the selection process is concerned, they could be getting pulled out of a jumbled bag, without any initial ordering.
But that’s just intuition and intuition can be wrong, so lets just double check. We can think of our initial theoretically sound algorithm as follows: We pull out all the type-A entries, randomly arrange them into some order, then pick the first 50 entries in the resulting list. Since any particular order has an equal probability of being the chosen order, each type-A entry has an equal probability of appearing in the first 50 slots in the chosen list. Note that in practice, even with the theoretically sound algorithm we don’t actually calculate the entire ordering beyond the first 50. However, we know that if we were to calculate the whole thing, we’d still end up with the same first 50 entries. So we can pretend as if constructing the whole ordering was part of our actual solution.
With the heuristic algorithm, we can apply a similar analysis. We’ll think of the heuristic algorithm as choosing an ordering on the entire database, both type-A entries and type-B. Then we’ll pull out the type-A entries into a separate list, but keeping the order the same. Then we’ll choose the first 50 entries from this list and take that as our sample. So that brings us to the question: Is every ordering of the type-A entries equally likely to result from choosing a random order for the entire database? The answer turns out to be yes, though it’s probably not worth going into the technical detail of trying to prove this.
The point of all this is that the heuristic algorithm in this case turns out to give essentially the same result as the theoretically sound algorithm, and will (in most cases) be significantly more efficient. However, in order to see that the results are the same, we needed to understand both algorithms in a different and much deeper way.
So here’s my (slightly self-serving) hypothesis: While we often think of the theoretical side of data analysis as the nit picker, reminding the data scientists to mind their error bars, a strong theoretical understanding of different algorithms, such as the ones I’ve discussed on this blog, can actually lead to simpler solutions based on better heuristics. I’ll try to continue to make the case for this hypothesis in future posts (which will hopefully be more regular from now on.)