In my last post, I described the PageRank algorithm that was the basis for the original Google search. This week, I want to use PageRank to motivate the MapReduce framework for distributed data analysis. These days, MapReduce is beginning to fall out of fashion and is being replaced by new technologies, many of which allow data stream analysis or are graph based. But most of these new distributed frameworks build, in one way or another, on the basic ideas of MapReduce. And as I’ll explain in future posts, you can see traces of the basic Map Reduce concepts in many of the new frameworks.
MapReduce is a batch processing framework, which means that it starts with a fixed collection of data, runs a sequence of steps on it, then outputs an answer. If the data changes or new data is added after the process is started, you have to run the whole thing again from scratch in order to incorporate the changes. That’s why, for example, it used to take a couple of days for new websites to show up in Google search results; once a Google spider found a new page, it wouldn’t be added to the search results until the next time the PageRank algorithm was run on the entire set of known web pages. Nowadays, Google uses a streaming framework that allows search results to be updated incrementally. I’ll discuss a few different approaches to streaming data analysis in future posts.
Last time, I described the PageRank algorithm in terms of tubes that carried water between buckets representing different websites. One can think of the algorithm as keeping track of how much water would be poured into each bucket during each minute that the system runs. (The water pumped out of the bucket is proportional to the water level in the bucket, which is much easier to keep track of.) Because the number of websites that we would want to keep track of is too large for a single computer to handle, we will divide it up on a distributed network. Each computer will be responsible for keeping track of the water level and the outgoing links for a subset of the entire list of websites.
But a website recorded on one computer will often have a link/tube to a website stored on a different computer. In order for the second computer to correctly calculate the amount of water coming in to this website, the first computer will have to send it the information about this link/tube.
So, you can think of MapReduce as a system for organizing the messages between the different computers on the network about how the pieces of data that they’re responsible for interact with each other. There are a number of other types of problems that require a similar sort of messaging. The most common example used to explain MapReduce is an algorithm that counts how many times each word appears in a large collection of texts. The idea is to have each computer in the network be responsible for some number of text files, and some collection of words. Each computer makes a list of how many times each word appears in the texts that it’s responsible for. Then it has to send these totals to the computers responsible for each word, which calculates the final total.
The word count example is reminiscent of the basic distributed learning framework that I discussed two posts back, in the sense that each computer calculates partial sums that are sent to a single computer to calculate the total. The difference here is that instead of having one master computer that computes one final sum, we have multiple final sums, calculated by multiple computers.
The MapReduce framework formalizes this in terms of three steps: Map, Combine and Reduce. The Combine step is optional, and is in many cases the same as the Reduce step, which is why it gets left out of the name. We’ll think about the steps in terms of sending messages between the computers in the network. The Map step can be thought of as writing down the list of messages that need to get sent. The Combine step consolidates the messages so that there is less to send across the network (since this is usually the bottleneck.) The Reduce step then processes all the messages received by each computer.
The step that’s missing from this is, of course, the sorting and delivering. This is called the Shuffle step and is all handled by the Map/Reduce framework. In fact, you can think of MapReduce as a program that has been almost entirely written, except that three routines have been left out, namely Map, Combine and Reduce. All you have to do is to fill in those three functions so that they fit into the existing program. Lets see how we might do this for PageRank.
First, for the Map step, we want to create the messages that need to get sent out. In other words, for each website/bucket, we want to send a message to each website/bucket that it has a link/tube to and tell it how much much water is coming in along that tube. Many of these different websites will be stored on the same computer, possibly even the computer that’s currently doing the processing. But the Map routine doesn’t have to worry about that, since it’s handled by the shuffle step. We’ll address each message to the website/bucket, and let the MapReduce framework decide which computer on the network to send it to.
In practice, the input to each function might be a filename or a link to an entry in a database, either of which will have to be processed in some way to extract the data. But to make the following pseudo-code easier to read, I’ve assumed that the input is given directly.
// input: water_level - the current water level the site being processed links - a list of site ids that are linked to from the current site // output: a list, each member of which is a pair (site id, message) function Map(water_level, links) // the output is split evenly between the links output = (0.7 * water_level)/size(links) message_list = new List for each (link_id in links) message_list.add(link_id, output) return message_list
The MapReduce framework will call this function for each website in the data set on each computer, then send the resulting messages to the computers that are responsible for each link id. These computers, in turn, run the reduce function, which in the case of PageRank calculates the new water level for each site. Again, in practice the output of this function would have to be saved to disk or entered into a database, but to make the pseudocode simpler, I’m just making the new water level the return value of the Reduce function.
// input: water_level: the current water level of this site incoming: a list of incoming messages // output: the new water level function Reduce(water_level, incoming) new_level = water_level for each (water_amount in incoming) new_level += water_amount return new_level
Now, you may have noticed that I left out the Combine step. Recall that the combine step is meant to reduce the number of messages that need to be transmitted across the network. In particular, it’s worth noting that during the Map step, MapReduce doesn’t send the messages across the network as they’re produced. With networks, it’s almost always faster to send one long message than to send lots of short ones. So once all the Map steps have been completed on a given computer, the framework sorts them all based on what computer they’ll be sent to (usually using a trick such as a hash function) and then sends one big packet of messages to each computer. During this process, it also has the option of running the Combine step.
In our case, what would the Combine step do? Lets say that after running all the Map steps for the different websites that are stored on computer A, we end up with three different links to the same website on computer B. Well, rather than including these three different messages in the packet of messages sent to computer B, and letting computer B add them all up, it would make more sense to have computer A add them up, and just include one message in the packet. So, the Combine step will be essentially identical to the Reduce step – adding up all the values that are sent to each site.
This Combine step is possible because addition is associative; we get the same sum no matter what order we add the numbers in. Not every Reduce step will have this property, so not every MapReduce algorithm will have a Combine step (and theoretically some may have a Combine step that is significantly different from the Reduce step.) Thus the MapReduce framework will only run a Combine step if it’s explicitly told to do so.
Note that if we were to implement PageRank under MapReduce in this way, it would only give us one iteration. To get to the final answer, we would have to run through the Map-Combine-Shuffle-Reduce cycle multiple times, as well as adding in a way to keep track of when to stop.
In fact, many algorithms end up requiring different combinations and sequences of Map-Combine-Shuffle-Reduce steps, so it has become common to think of these as separate building blocks that can be arranged and stacked in lots of different ways. This is the idea behind Google’s FlumeJava, as described in this white paper (PDF). The FlumeJava library is not publicly available, but there is at least one open source clone of it.
But perhaps more importantly, echoes of these Map-Combine-Shuffle-Reduce steps appear in many of the newer distributed learning frameworks, even those that replace batch processing with streaming, and those that are specifically designed for graphs. I plan to describe some of these new frameworks in upcoming posts, and I’ll try to point out the hints of MapReduce that arise.