## Distributed Learning

So far on this blog, my focus has been on conventional algorithms for data mining and machine learning, i.e. algorithms that can run on a standard desktop or laptop computer with a single processor. However, one of the trends in modern data analysis is parallel computing – the use of networks of machines (often called clusters, but unrelated to the clusters that we’ve seen previously on this blog) that work together, dividing up the computation and allowing the operator to work with much larger data sets. This is one of the main ideas behind the buzzword “Big data.” As it turns out, many conventional machine learning/data mining algorithms can be efficiently translated into parallel algorithms, and in this post, I want to give an introduction to this process. There won’t be a lot of geometry directly involved in this, but the better we understand the conventional algorithms (say, via geometry), the easier it will be to effectively translate them into parallel/distributed algorithms.

Before we get into the basics of parallel/distributed machine learning, there’s one thing that I should point out: If you start with a poorly designed/poorly optimized algorithm, you will generally get a much bigger gain in efficiency by optimizing the algorithm than by just adding machines to your parallel cluster. So, before trying to parallelize an algorithm it’s best to first make sure that the conventional algorithm is designed as efficiently as possible.

The most common setup for parallel computing these days is to have a large number of individual machines, each with its own processor, connected to its own memory and hard drive(s). Each processor may have two or more “cores” that can work in parallel. However, unlike with traditional super computers, the processors in different machines do not have direct access to each other’s data. (Originally, the term super computer referred to machines with a large number of parallel processors and centralized, shared memory, though nowadays it’s also often applied to large clusters with distributed memory.) Because of the distributed memory, individual processors must explicitly communicate with each other, usually through ethernet or one of its faster relatives. This setup has the advantage that it is easy to add machines to a cluster, plus the machines that are used are relatively inexpensive (so called “commodity” machines.)

On the other hand, having so many (relatively cheap) machines means that you are essentially guaranteed to have some number of them not working properly at any given time. So, any software that runs on such a cluster has to include measures that allow it to keep running even if some of the machines fail. Usually, this means keeping multiple copies of data, regularly checking on the machines involved, and having procedures for switching parts of the process to healthy processors when necessary. But for this post, I’m going to gloss over this aspect.

Because each machine in a cluster has its own hard drive, it is common to store a (large) data set across multiple machines. Usually, each machine has a subset of the data points/rows of the data set. Because communication across ethernet is slow compared to communication between each processor and its local memory and hard drive, the goal is usually to have each processor access only its own data as much as possible. There will often be one machine that is designated as the leader, while the rest are workers. The general idea behind creating a parallel algorithm is to have each of the worker machines summarize its local data in some way (depending on the algorithm), and then send the summary to the lead machine in as concise a way as possible so that the lead can effectively combine all the different summaries that it receives.

You can think of the lead machine like the CEO of a big corporation who gets reports from all the managers below her. If the manager sends the CEO too much detail about the operations of their unit, the CEO won’t have the time to go through it all. If the manager sends too simple a message, the CEO won’t have enough information to effectively compare it to the reports from the other units. So the job of each manager who reports to the CEO is to find the right balance between detail and brevity.

A distributed algorithm has to find the exact same balance. Depending on the algorithm, there will be some minimal amount of information that the lead processor needs to get from each worker. However, any information beyond this will make the algorithm slower.

To get an idea of how this can work in practice, I want to look at principal components analysis (PCA). We’ll think of the algorithm slightly differently than how I described it in the previous post, by breaking it into four steps. First, we find the centroid/average of the data. Second, we translate the data by subtracting this centroid from each data point. Third, we calculate the covariance matrix. Fourth, we calculate the eigenvectors of the covariance matrix. We’ll look at how to implement each of these steps in turn.

For step two, the master sends the centroid vector to each worker machine. Each worker is then responsible for translating its own data points.

Step three gets a little tricky because calculating the covariance matrix involves multiplying two other matrices. In the conventional algorithm, we have to calculate $X^T X$ where $X$ is the matrix whose rows are the data points and whose columns are the features (after subtracting the centroid from each row). The matrix $X^T$ is the transpose matrix that we get from switching the rows and columns of $X$. In general, multiplying two matrices requires us to multiply together numbers from different parts of the two matrices. This could be problematic if each machine in our cluster only has access to certain parts of the matrix.

But if we look closely at this particular calculation, $X^T X$, it turns out that we can do it in a way that only requires each machine to use its locally stored data. We carry out the matrix multiplication by multiplying numbers from $X^T$ by numbers from $X$, then adding them up in a specified way. But we always multiply numbers from column i of $X^T$ by numbers in row i of $X$. Because $X^T$ is the transpose of $X$, this means that we’re always multilpying together two numbers from the same data vector.

In particular, we can redefine the calculation of $X^T X$ as follows: For each data vector $v$, we will calculate the matrix $v^T v$ in which the number in row i, column j is the product of the number in dimension i of $v$ and the number in dimension j. If we calculate this matrix for each data point (i.e. each row in $X$) and add them all up, the result will be exactly $X^T X$.

So we can carry out step three in much the same way as step one: Each processor calculates $v^T v$ for each of its stored locally data points, then adds them all together to form a single matrix. Each processor then sends this one matrix to the master, which adds them all together. Because matrix addition, like vector addition, is associative, the final sum is exactly $X^T X$.

For the final step, we have to calculate the eigenvectors of this matrix. However, note that the size of this final matrix depends on the number of features, rather than the number of data points. The usual assumption when dealing with distributed data like this is that the number of features will remain a manageable size, while the number of data points will grow too large for a single processor. Since the size of the final matrix is determined by the number of features, it will usually be small enough that we can just use a conventional algorithm on the lead machine for the last step.

So, in the end, we repeatedly used one trick, in slightly different forms, to make PCA parallel. Chu et. al. [1] have shown that you can use this same trick – having workers compute partial sums and then sending them to the master for the total sum – for a wide range of algorithms, including many of the algorithms that I’ve discussed previously on this blog.

Interestingly, one of the harder algorithms to adapt to a distributed cluster is the back-propagation method for training a neural network. The problem is that, by design, this algorithm requires adapting the weights of the neural network to one data point at a time. You can’t process the next data point until you’ve finished processing the current one, so you can’t have multiple processors work simultaneously. One way to fix this issue is a technique called batch training, in which you use multiple data points each time you update the weights. There are also some more advanced techniques for pre-training a neural network that make the back-propagation step go faster, and many of these techniques are easier to adapt to a distributed system. But that’s a topic for a future post.

[1] Map-reduce for machine learning on multicore, CT Chu, SK Kim, YA Lin, YY Yu, G Bradski, AY Ng, K Olukotun, NIPS, 2006. (The link has a lapsed security certificate but seems to be safe.)