As I noted previously, while there are many papers in the research literature that describe interpretable models, or ways of adding a layer of interpretability on top of existing models, most of these papers are not explicit about what they mean by interpretability. Two exceptions that I’ve found (though there are probably others) are Ribeiro, Singh and Guestrin’s paper on Locally Interpretable Model-agnostic Explanations (LIME) and Kulesza, Wong, Burnett and Stumpf’s paper on Principles of Explanatory Debugging (PED). The properties I’ll propose below are taken from these two papers, with different names and a few modifications.

It’s also worth mentioning a distinction that Zack Lipton makes in his Mythos of Model Interpretability, between transparency – explaining the way the whole model works – and post-hoc interpretability – explaining how individual predictions were made, without necessarily describing the process for calculating them. Even though most of what I’ve written previously on this blog has focused on transparency, this post and the rest of this series of posts will focus on post-hoc interpretability.

So, here are the seven properties. A post-hoc model interpretation should be:

**Concise.** Both the LIME paper and the PED paper point out that an explanation should not overwhelm the user with details about how a prediction was made. A central goal of machine learning is to deal with the complexity that we can’t or don’t want to incorporate into our mental models. If an interpretation forces the user to consider all the features and coefficients that went into the model, that defeats the purpose. An explanation should minimize the cognitive load required of the user.

For example, if a prediction can be explained by indicating 100 or 1000 factors that contributed to the value, it would be more concise to have the explanation algorithm pick out the top 5 according to some notion of importance, and only present those. Of course, if you make an explanation too concise, it may cease to be useful. So you’ll always want to balance making an explanation concise against the the other desirable properties, particularly:

**Faithful.** The explanation should accurately describe the way the model made the prediction. For example, you might try explaining the predictions from a neural network by separately training a linear model, then presenting the features that contributed to the linear model’s prediction (which are relatively easy to determine) as the features that *probably* had the most impact on the neural net’s prediction. This explanation would not be faithful, since it does not describe the actual model. The authors of PED use the term *sound* for this property, and the LIME paper calls it *local fidelity*.

**Complete.** An explanation is *complete* if it explains all the factors and elements that went into a prediction. In order to be concise, an explanation should never be absolutely complete, but you can measure how complete (how close to complete?) it is. The authors of the PED paper have an earlier paper, Too Much, Too Little or Just Right? that examines the tradeoff between a model being concise, faithful (sound) and complete.

**Comparable.** An explanation should help you compare different models to each other by examining how they handle individual examples. The LIME authors call this *model agnostic*. This is relatively straightforward for explanation algorithms that can be applied to multiple types of models, and defines the same form of explanation for all of them. It’s trickier for interpretable models that provide an explanation in a form that’s specific to that type of model. But if the explanation gives enough insight into how the prediction was made, it could still be comparable.

**Global.** The explanation should indicate how each individual prediction fits into the overall structure of the model. The LIME authors call this a *global perspective. It’s *related to the notion *transparency* I mentioned above – understanding how the whole model works. However, a global explanation only has to indicate enough of the model’s structure to provide reasonable context for each individual prediction. Is the predicted value especially high or low? Were there a lot of nearby points in the training set? Is the variance between these data points high? Is the prediction based on extrapolating from lots of unrelated data points (and thus less reliable)?

**Consistent.** Users will typically interact with a model over and over again. Ideally, they should learn more about the model over time and make better decisions as a result. This is a major theme in the PED paper, since it focuses on debugging models through repeated explanations. An explanation algorithm is *consistent* if each successive explanation helps the user to better understand later predictions. Perhaps more importantly, the user should never perceive a contradiction between different explanations. For example, if a feature increases a risk prediction for one data point but decreases it for another data point, the explanations should include enough information for the user to understand why.

**Engaging.** The PED paper notes that an explanation should encourage a user to pay attention to the important details, and they point to research showing (not surprisingly) that users who pay more attention to explanations become familiar with the model faster and ultimately make better decisions. Some of this is covered by the earlier properties: If the explanations aren’t concise, the users’ eyes will glaze over. If they aren’t faithful and consistent, the users will be distracted by the inconsistencies. But there’s more to being engaging than just those properties, and it should be considered separately.

This last property has a lot to do with how the explanation is presented rather than the explanation itself, so you could argue that it’s more of a User Experience (UX) issue. But in case you didn’t notice, this whole post has been secretly about UX – how users interact with an interface that happens to have an ML model behind it. Interpretability is as much a UX problem as it is a machine learning problem, which is part of what makes it both so interesting and so difficult.

]]>The research literature identifies many different reasons to want model interpretability, many of which are described in Zack Lipton’s Mythos of Model Interpretability. Zack organizes these into four categories – Trust, Causality, Transferability and Informativeness. (Read the paper for what these mean.) I’m going to suggest a slightly different set of categories based on the question “If I have an interpretable model, what should it allow me to do?” I’m not suggesting that an interpretable model should do all the things below, but it should do at least some of them. Though if you think this scheme leaves something out, please let me know in the comments!

So, here it is: An interpretable model should allow you to…

**Identify and Mitigate Bias.** All models are biased – both algorithmic models and mental models. In fact, algorithmic models can magnify the bias of our mental models, as Cathy O’Neil has written extensively about. You can never completely eliminate bias, but you can often fix the more egregious forms, or at least choose not to use the models that are biased in unacceptable ways. This is roughly what Zack and others refer to as *trust*. It seems to be the most common motivation for interpretability described in the literature, probably because it’s of primary interest to the people who are developing the algorithm model, convincing others to use it, and then writing papers about it.

In fact, the ability to understand the biases in a model can be more important than accuracy for model adoption. For example, the introduction of this paper describes a case where a large healthcare project chose a rule-based model over a more accurate neural network for predicting mortality risk for patients with pneumonia. The decision was made after they discovered a rule in the rule-based system suggesting that having asthma lowered one’s risk of dying from pneumonia. It turned out this was true, only because pneumonia patients with pre-existing asthma were consistently given more aggressive treatments that led to better outcomes. However, the model would have suggested they required less treatment since they were at lower risk. The group running the project realized that the neural network probably had similar biases, but they had no way of telling what or how bad they were. So they decided to go with the model whose biases they could recognize, and thus mitigate.

Recognizing the biases in our mental models is notoriously difficult, but recognizing bias in a black-box algorithmic model is even harder. With our mental models, we can use self-reflection and the ability to recognize new factors to reduce bias. Algorithmic models can train on larger data sets and treat all data points equally, but they can’t self-reflect. The right type of interpretability could allow us to apply “self-reflection” to algorithmic models and get the best of both worlds.

**Account for context.** This is essentially what I wrote about in my last post, so I won’t go into additional detail here. As I described previously, an algorithmic model can never account for all the factors that will affect the decision that the user finally makes. An interpretable model that helps you understand how the factors that are included in the model led to a prediction should allow you to adjust how you use the prediction based on these additional factors.

**Extract Knowledge.** Algorithmic models often have a form that’s incompatible with mental models. Mental models are made up of relatively simple causal relationships, augmented by flexible, subconscious intuition. Algorithmic models are essentially probability distributions that measure correlations between rigidly defined values. However, if you look at a series of predictions from a model, the pattern recognition parts of your brain won’t be able to help themselves from trying to extract rules to add to your mental models. The problem is that these patterns may not be real, particularly if the set of examples you look at is biased.

An interpretable model should help you to determine if the patterns that appear to be present in the model are really there, or just artifacts of a biased set of examples. This is similar to identifying bias, as described above, except that here you’re learning from the model rather than evaluating it. With an interpretable model, you should be able to combine the strong pattern recognition and simplification skills of the human mind with the algorithmic model’s ability to learn from massive amounts of data. For example, there’s now a fair amount of research on causal inference in algorithmic models. A causal algorithmic model should be more compatible with mental models that rely heavily on causal reasoning, making it even easier to extract rules that can be incorporated into your mental models. Zack’s *Mythos* paper includes causality as a motive for interpretability, though this isn’t the only type of rule you might want to extract.

**Generalize.** Algorithmic models are trained on carefully collected datasets to solve narrowly defined problems. Mental models are trained on a fire hose of input and applied to vaguely defined problems that they usually weren’t trained for. If you find an algorithmic model that works well for the problem it was trained on, you may be tempted to apply it to other problems, given how well that seems to work for mental models. In some cases, it might work for algorithmic models too, but in many cases it won’t. An interpretable model should help you determine if and how it can be generalized.

For example, lets say that the pneumonia risk model described above works out really well and you want to use it to predict risks for other types of lung infections. A good approach to interpretability might tell you that the model relies on properties of pneumonia that are different for other infections. So you’d better create a new model for them. In fact, you could argue that this is what happened in the case described above: They tried to use a model that was trained to predict mortality risk for the problem of predicting which patients required the most care. In this case, they didn’t even realize they were generalizing until they tried to interpret one of the models.

So, there’s my proposal for the main motivations of interpretability: An interpretable model should allow you to identify and mitigate bias, account for context, extract knowledge and generalize. There’s a fair amount of overlap between these, but they capture the different types of motivations I’ve seen in the literature. In my next few posts, I’ll use these motivations as a lens through which we can look for ways to tell when models are interpretable.

]]>The most common way that you delegate a part of your thought process to an algorithmic model these days is probably with personalized recommendation systems. For example, the ratings you see on a movie streaming site are often calculated based on what you’ve watched or liked in the past. Such a system might look at things like the genre of each movie, its actors and director, or how much “viewers similar to you” liked it.

But when you actually select a movie, you consider a number of factors that aren’t part of the model, such as the kind of mood you’re in, how much time you have or who you’re watching it with. The video streaming site could try to account for some of these by adding additional factors into the model, but they can never get all of them. No matter how complex they make the model, there will always be factors that could not have been anticipated when it was trained. If you completely delegate the decision to the model, picking the highest rated movie without accounting for these external factors, you’ll probably be in for a rude surprise.

Your mental models, on the other hand, can adapt to account for new and unexpected factors when you make the decision. So when you select a video, you have to combine the algorithmic recommendation with your own mental model of what type of movie you would like in the current context. Your mental model will include complex relationships between some of the factors used in the algorithmic model and the contextual factors that aren’t included. The better you can understand how the algorithmic model used the different factors to arrive at its prediction, the better equipped you will be to adjust the algorithmic recommendation based on contextual factors.

Imagine you see a musical comedy that is rated 4.2 out of five stars. From the number alone, you don’t know if that score is for its music or for its humor. If you’re in the mood for a comedy, you don’t want to pick a movie that isn’t very funny, but has great music. So with just the number, you’ll probably have to come up with your own estimation of how much you’ll like the movie, ignoring the algorithmic rating entirely. You effectively have to choose between using the algorithmic model without context or using your mental model without the help of the algorithm.

For the rating to be useful, it needs to come with additional hints about how it was calculated. For example, the result might point to a similar movie that you previously watched or rated highly. Or it might point to the factor that most contributed to the rating, such as the genre or the lead actor. While neither of these completely explain how the rating was calculated, they give you some amount of insight, which you can use to mentally adjust the rating based on additional context. You still don’t want to delegate the entire decision to the model, but you can delegate a part of the thought process. A model that can produce such insights is often called *interpretable*, though this term is used with a wide range of meanings in the literature.

In this example, even without hints for interpreting the predictions, you can probably still gain some information from the algorithmic model because you have a very good mental model of the types of movies that you like. But if you’re trying to understand a more complex and less intuitive system or situation – financial markets, human health, politics – you will have a less reliable mental model and will need to rely much more heavily on whatever information you can get from the algorithmic model. If we want algorithmic models to be successful in these types of contexts, we need to be able to present their predictions in ways that allow users to seamlessly and accurately interpret them, so they can delegate more of the decision making process while minimizing the risk of a nasty surprise.

In some sense, an interpretable model pokes holes in the barrier between the algorithmic model and your mental model. The ideal, of course, would be to break down the barrier entirely, so that you can fully incorporate the information from the algorithmic model into your mental model’s assessment. That’s probably impossible, but I’m convinced that we can poke significantly larger holes than have been made so far.

In my next few posts, I will discuss a number of different ways that researchers have tried to understand what interpretability means and to develop interpretable models. This is a subtle problem at the boundary between psychology and technology, with many directions that are waiting to be explored. I’m very excited to see how this field develops over the next few years.

]]>Lets start with what I mean by a “typical” RNN. In my post on RNNs, I mentioned three basic types of operations in the computation graph: 1) Outputs from nodes (which may be earlier or later in the network) can be concatenated into higher-dimensional vectors, for example to mix the current input vector with a vector that was calculated in a previous step. 2) Vectors can be multiplied by weight matrices, to combine and transform the concatenated vectors. 3) A fixed non-linear transformation can be applied to the output from each node in the computation graph.

The limitation with the RNNs defined by these three operations stems from the fact that while the weight matrices are updated during the training phase, they’re fixed while each sequence of inputs is processed. So each step in the input sequence is combined with the information stored from earlier steps in the same way each time. In some sense, the network is forced to remember the same things about each step in every sequence, even though some steps may be more important, or contain different types of information, than others.

An LSTM, on the other hand, is designed to be able to control what it remembers about each input, and to learn how to decide what to remember in the training phase. The key additional operation that goes into an LSTM is 4) Output vectors from nodes in the computation graph can by multiplied component-wise. In other words, the values in the first dimension of each vector are multiplied together to get the first dimension of a new vector. Then the second dimensions of the two vectors are multiplied, and so on.

This is not a linear transformation, in the sense that you can’t get the same result by concatenating the two vectors, then multiplying by a weight matrix. Instead, this is more like treating one of the input vectors as a weight matrix that you’ll multiply the other output vector by. But unlike the weight matrices in a typical RNN, this “weight matrix” vector is determined by a computation somewhere else in the network, so it’s determined when the new data is processed, rather than fixed throughout the evaluation phase.

This “weight matrix” vector is in many ways not as impressive as one of the built-in weight matrices in a typical RNN. It’s equivalent to a matrix with the values of the vector along the diagonals, and the rest of the entries equal to zero. So it can’t do anything fancy to the other vector. Instead, you should think of it as a sort of filter that decides what parts of the other vector are important. In particular, if the “weight matrix” vector is zero in a given dimension then the result of multiplication will be zero in that dimension, no matter what the value was for that dimension in the other vector. If it’s close to 1, the output value is exactly equal to the value of the other vector in that dimension. (And a non-linear transformation is often applied to make sure the “weight matrix” values are very close to either 0 or 1.) So the “weight matrix” vector chooses what parts of a second vector get passed on to the next step. Because of this, the nodes in the computation graph where a “weight matrix” vector gets multiplied by a data vector is often called a *gate*.

An LSTM uses this fancy new fourth operation to create three gates, illustrated in the Figure below. This shows the inside of a single cell in an LSTM, and we’ll see farther down how this cell gets hooked up on the outside.

The LSTM cell has two inputs and two outputs. The output at the top (labeled *out*) is the actual RNN output, i.e. the output vector that you will use to evaluate and train the network. The output on the right (labeled *mem*) is the “memory” output, a vector that the LSTM wants to record for the next step. Similarly, the input on the bottom (labeled *in*) is the same as the input to a standard RNN, i.e. the next input vector in the sequence. The input on the right (also labeled *mem*) is the “memory” vector that was output from the LSTM cell during the previous step in the sequence. You should think of the LSTM as using the new input to update the value of the memory vector before passing it on to the next step, then using the new memory value to generate the actual output for the step.

When a new input vector comes in through the bottom of the LSTM, it is multiplied by the weight matrix , which puts it into the same form as the memory vector. We want to combine this with the memory vector using gates like we described above. So the input vector is also concatenated with the previous cycle’s memory vector, and this is multiplied by three different weight matrices: controls what the cell “remembers” about the input, controls what the cell “forgets” from memory and controls what part of the current memory is output at the top. The results of each multiplication are fed through a non-linear transformation that I didn’t include in the Figure.

Then these vectors are fed into gates, defined by our new network operation (indicated by a circle with a dot in it) as shown in the Figure. The middle gate filters the memory vector from the previous step and the bottom gate filters the transformed input vector. These two gated vectors are then added together to produce the memory vector for this step. In addition to becoming the memory vector that is sent to the next step in the LSTM, the memory vector is also filtered by the top gate to produce the actual output from the LSTM.

The key step in this process is how the memory vector and the transformed input vector are independently gated before being added together. In the simplest setup, each of the “weight matrix” vectors would have all their values 0 or 1, and these would be complementary between the two gates, so that each dimension gets the value from one or the other. The values, which are calculated from both the current input and the current memory vector, would thus determine which of the dimensions in the memory vector should be passed on to the next step, and which should be replaced with the corresponding value from the transformed input vector. But in practice, the network gets to “learn” whatever behavior is most effective for producing the desired output patterns, so it may be much more complex.

Speaking of which, lets quickly look at how an LSTM is trained, using the idea of unrolling that I described in my previous post. As you probably picked up from the discussion above, we hook up an LSTM externally by adding an edge from the right memory out, back around to the left memory in. This is shown on the left in the Figure below. This edge is a bit unwieldy to draw, wrapping around behind the cell. But once you unroll the network, as on the right of the Figure, it forms a nice, neat, horizontal edge from each step in the LSTM to the next.

As with a standard RNN, you can use unrolling to understand the training process, by feeding the whole sequence of inputs to the network at once, and using back propagation to update the weight matrices based on the desired sequence of outputs.

Note that LSTMs are fairly “shallow” as neural networks go, i.e. there aren’t that many layers of neurons. In fact, if you ignore the gates, there is a single lonely weight matrix between the input vector and the output vector. Of course, there are a number of ways one could modify an LSTM to make it more flexible, and a number of people have experimented with “deep LSTMs”. But for now, I’ll leave that for the readers’ imaginations, or maybe a future post.

]]>Recall that a neural network is defined by a directed graph, i.e. a graph in which each edge has an arrow pointing from one endpoint to the other. In my earlier post on RNNs, I described this graph in terms of the classical neural network picture in which each vertex is a neuron that emits a single value. But for this post, it’ll be easier to describe things in the tensor setting, where each vertex represents a vector defined by a row/layer of neurons. That way, we can think of our network as starting with a single vertex representing the input vector and ending at a single output vertex representing the output vector. We can get to every vertex in the graph by starting from this input vertex and following edges in the direction that their arrows point. Similarly, we can get from any vertex to the output vertex by following some path of edges.

A standard (non-recurrent) feed-forward network is a directed acyclic graph (DAG) which means that in addition to being directed, it has the property that if you start at any vertex and follow edges in the directions that the arrows point, you’ll never get back to where you started (acyclic). As a result there’s a natural flow through the network that allows us to calculate the vectors represented by each vertex one at a time so that by the time we calculate each vector, we’ve already calculated its inputs, i.e. vectors on the other ends of the edges that point to it.

In an RNN, the graph has cycles, so no matter how we arrange the vertices, there will always be edges pointing backwards, from vertices whose vectors we haven’t yet calculated. But we can deal with this by using the output from the previous step.

For example, the vector in the Figure to the right is calculated from the input vector and the intermediate vector (multiplied by a weight matrix ). The circle with the ‘c’ in it represents concatenating the vectors and , which means creating a new (higher dimensional) vector where the first half of the entries come from the input vector and the second half come from .

When the first input value gets to vertex , we don’t yet have a value to use for , so we’ll just use the zero vector of the appropriate dimension, and we’ll let be the value that we calculate. Similarly, we can calculate by multiplying by the matrix and the first output value from and .

Then comes the second value in the input sequence, which we’ll call . When we go to calculate , we haven’t yet calculated , but we do have lying around from the last step. So we’ll calculate using and . Then we can calculate and . This will be used to calculate in the next step, and so on.

In order to better understand what’s going on here, lets draw a new graph that represents all the values that we’ll calculate for the vertices in the original graph. So, in particular and will be two separate vertices in this new graph, and an edge goes from to the concatenation operator that leads into vertex . This is shown in the Figure below for the first four steps in an input sequence.

The first thing you should notice about this graph is that it has multiple inputs – one for each vector in the input sequence – and multiple outputs. The second thing you might have noticed is that this graph is acyclic. In particular, the cycle that characterized the original graph has been “unrolled” into a longer path that you can follow to the right in the new graph.

(For any readers who have studied topology, this is nicely reminiscent of the construction of a universal cover. In fact, if you unroll infinitely in both the positive and negative directions, the unrolled graph will be a covering space of original graph. And as noted above, it will be acyclic (in terms of directed cycles, though not necessarily undirected cycles), which is analogous to being simply connected. So maybe there’s some category theoretic sense in which it really is a universal cover… but I digress.)

It turns out you can always form a DAG from a cyclic directed graph by this procedure, which is called *unrolling*. Note that in the unrolled graph, we have lots of copies of the weight matrices . These are the same at every step in the sequence so they don’t get unrolled.

However, we do need to update the weight matrices in order to train the neural network, and this is where the idea of unrolling really comes in handy. Because the unrolled network is a DAG, we can train it using back-propagation just like a standard neural network. But the input to this unrolled network isn’t a single vector from the sequence – it’s the entire sequence, all at once! And the target output we use to calculate the gradients is the entire sequence of output values we would like the network to produce for each step in the input sequence. In practice, it’s common to truncate the network and only use a portion of the sequence for each training step.

In the back-propagation step, we calculate gradients and use them to update the weight matrices . Since we have multiple copies of each weight matrix, we’re probably going to get different gradients for each copy. But we want all the copies of each matrix to stay the same, so we’ll combine all the gradients, usually by taking an average, and use this to update the base matrix that all the copies are taken from.

In practice, you don’t necessarily need to explicitly construct the unrolled network in order to train an RNN with back-propagation. As long as you’re willing to deal with some complex book keeping, you can calculate the gradients for the weight matrices directly from the original graph. But nonetheless, unrolling is a nice way to think about the training process, independent of how it’s actually done.

]]>But rather than start with the statement of Bayes’ Theorem, I want to use an old math teacher trick (which I realize many students hate) of trying to derive it from scratch, without stating what we’re trying to derive. Rather, we’ll start by modifying a problem that I described in an earlier post on probability distributions.

Lets pretend we have a robot arm with two joints: The first is fixed to the center of a table and spins horizontally. There’s a bar from this first joint to the second joint, which also spins horizontally and is attached to a second bar. The two bars are the same length, as shown on the left in the Figure below. The game is to randomly pick a pair of angles for the two joints, then try to guess the *x* and *y* coordinates of the hand at the end of the arm. As I described in the earlier post, we can think of the (density function of the) probability distribution of all the possible *(x, y)* coordinates, and it would look something like what’s shown on the right of the Figure. Here, darker colors indicate larger values of the function.

In the earlier post, we determined that the best place to predict that the hand would be is near the center, since that’s where the probability density is highest. But now we’re going to modify the game a bit: What if each time after spinning the wheel, we are told something about the *x*-coordinate; either its exact value or a small range. Then how would that change our prediction? Note that if we’re given the exact *x-*value and asked to predict *y*, then this is essentially the regression problem except that the probability distribution doesn’t look anything like the one we used in the post on regression.

But lets start with the case where we’re given a range, for example if we knew that the *x-*value was between 1/4 and 1/3. Then the probability of getting any point with an *x*-value outside this range would be zero, so the probability density that we would use to guess the *y*-value would be equal to zero for all points with *x*-values outside this range. So it would look something like the Figure on the right.

But dropping those values to zero isn’t enough to get the new distribution; the problem is that when we add this restriction on *x*, the probability of any point with an *x*-value in the correct range will increase. The question is: By how much will they increase?

In order to answer this question, we need to look more closely at what the probability density function really means. The first thing to note is that the value of the probability density function at a point is not the probability of choosing that point. In fact, the probability of picking any one point is zero, since there are infinitely many possible *x* and *y* values.

In order to understand the meaning of the probability density function, we need to use integrals, but (as usual) we can avoid much of the technical details by describing things in terms of the geometry that underlies those integrals. In particular, we’re going to think of our probability density function as describing the elevations of a mountain whose base is the square in which our robot arm rotates. But it won’t be a normal looking mountain – because of the way the density function looks, it’ll have a high peak in the middle, surrounded by a deep moat, then a high circular ridge (shorter than the central peak) around the outside.

I’ve attempted to draw this on the left, but you’re probably better off using your imagination to picture it. Once we’ve transformed our density function into this mountain, we can replace the word “integral” with “volume” and we’ll be able to calculate some probabilities.

Now, as noted above, if we pick one specific point, the probability that the hand will end up there is zero. However, if we pick a particular region *A *of the square, such as a rectangle defined by a range of *x-*values and a range of *y-*values, then there may be a non-zero probability that the hand will stop within *A*. (Though everything I will say below also holds true for more complex shapes *A*, as well as for shapes in higher-dimensional probability spaces.) In particular, the density function is defined specifically so that the probability will be equal to the volume of the part of the mountain above the shape *A*.

In other words, if we were to take a band saw to the mountain, following the outline of *A*, then the volume of the piece that we cut out would be equal to the probability of the robot’s hand stopping within that region. We’ll call this volume/probability *P(A)*.

So, not only is the value of the probability density function at a point *not* the probability of getting that point (since it’s always zero), the value of the density function at a point doesn’t even need to be less than one. In particular, if there is a region *A* with a small area but a very high (though still less than 1) probability, the values of the density function would need to be very high in order to get the appropriate volume.

If we choose region *A* to be the entire square then *P(A) =* 1 because the arm is constrained to stay within that region. So the volume of the entire mountain is 1. If we choose a smaller region *A*, and an even smaller region *B* contained in *A*, then we’ll get a piece with smaller volume *P(B) < P(A)*, and thus lower probability as we would expect.

But now lets return to the original question of how to modify our density function after we’ve narrowed down the set of possible outcomes to a smaller range *A *of *x-*values. This function will define a different mountain that is low and flat everywhere outside of *A*, but has the same elevations as the original within* A*, as on the left in the Figure below. We want to modify this function to give us a new probability density defining a volume function which we’ll write *P(*|A)* where * can be any region of the square. Since we know the robot hand landed in *A*, the overall probability, i.e. the volume *P(A|A) *of the new mountain, should be 1. However, since all we did was flatten the parts of the mountain outside the region, its volume is initially quite a bit less than 1.

In order to get the correct volume, we’ll need to scale the function up, i.e. multiply each value of the function by a constant *k*. The resulting function will define a mountain more like the one shown on the right above.* *For any region *B* contained in *A*, we’ll have *P(B|A) = kP(B).* Since we want *P(A|A) = 1 = P(A)/P(A)*, the only possible value for *k* is *1/P(A) *and we get *P(B|A) = P(B)/P(A).*

This is the case when the region *B* is contained in *A*. But what if it isn’t? For example, if we want to predict a range of *y*-values once we know the robot hand is in a certain range of *x*-values, then *A* will be a vertical strip of the square, and *B* will be a horizontal strip of the square, with the two intersecting in a smaller rectangle. (But as I noted above, we could just as easily let *A* and *B* be arbitrary blobs in the square or even blobs in a higher-dimensional space, but lets not get too complicated…)

So, if we want to calculate *P(B|A)* in this case, we need to consider two parts of *B* separately: The density function above the part of *B* that is outside of *A* will all get flattened to zero, so *P(B|A)* is completely determined by the part of *B* inside of *A*, i.e. the intersection *A ∩ B.* In other words, we have *P(B|A) = P(A ∩ B)/P(A). *Note that there’s no difference between *A* and *B* in this formulation, so we also have *P(A|B) = P(A ∩ B)/P(B).* We can solve both equations for *P(A ∩ B) *to get *P(B|A)P(B) = P(A ∩ B) = P(A|B)P(A). *Finally, if we divide both sides by *P(B)*, we get Bayes’ Theorem:

*P(B|A) = P(A|B)P(A)/P(B)*

Of course, for the problem we started out with, the original equation *P(B|A) = P(A ∩ B)/P(B)* may sometimes be more useful. But in the standard setting of Bayes’ Theorem, *P(A ∩ B)* is the probability that both events happen (or both statements are true) so it might be harder to calculate.

For extra credit, take a minute to think about how you might calculate the probabilities of different *y*-values if we knew the exact value of *x* rather than a range. I’ll give you two hints: First, note that the probability density function over the vertical line defined by a single *x-*value defines a single-variable function like you might find in Calculus I and II, and there is some area (rather than a volume) below this function. Second, note that you can take *A* to be a small rectangular strip around the line defined by the *x*-value, calculate its volume, then make the strip smaller and smaller and take a limit.

But this post is already long enough, and I expect that most of my readers either don’t want to read about limits, or would rather work it out themselves. (For me, it’s both.) So I’ll leave it there.

]]>Recall that the standard view of an artificial neural network is a directed graph of neurons, where each neuron calculates a weighted sum of inputs from other neurons, then applies a non-linear function to determine its own output. Many neural networks have neurons arranged into rows or layers, with the neurons from one layer connected to the neurons in the next layer according to some pattern.

In my last post, I pointed out that you can think of each neuron as actually being two neurons – a linear neuron that calculates the weighted sum, which it sends to a non-linear neuron that applies the non-linear function to the output from the linear neuron. From this perspective, the linear neurons in each layer collect the output from the previous layer, and the non-linear neurons send their outputs to the next layer. As I described last time, you can then think of the output from each layer as a vector with one dimension/feature for each neuron. The connections between successive layers define a matrix such that the outputs of the linear neurons in one layer define a vector that’s equal to the outputs from the non-linear neurons of the previous layer multiplied by this matrix.

For a basic feed-forward network, we just have a sequence of layers, one after the other, as in the Figure below. I’ve indicated which parts of the Figure correspond to these vectors and matrices, and it’s possible to translate the diagram into the equations that describe how these all relate to each other. But the translation can be a bit tricky, and for more complex networks such as convolutional networks and RNNs, it becomes even harder to understand how the network functions from this perspective.

TensorFlow improves on this by dropping the biological analogy in favor of a graph that directly encodes the mathematical relationships between the elements. A TensorFlow graph for the neural network in the above Figure is shown in the Figure below. Instead of individual neurons, the elements of this graph are vectors, matrices and operations, with edges indicating how the operations are applied. (The v’s are vectors, W’s are matrices, and circles/ellipses are operators.) To figure out how each element is calculated, you simply follow the arrows backwards.

You can create a graph like this in TensorFlow by writing a script in Python. This graph includes operators for matrix multiplication and a non-linear operator (such as the sigmoid or the ReLu), but TensorFlow has a number of other operators as well, such as convolutional multiplication and pooling. Plus, since TensorFlow is open source, anyone can write their own operators. There are a number of tutorials on the official site where you can find details and examples.

Once you have this graph, you can ask TensorFlow to evaluate it for a given collection of inputs, but the more interesting part is, of course, the training. In other words, we want to select the values in the weight matrices by incrementally adjusting them via back propagation. This involves evaluating data points (following the graph forward), then determining the error and pushing it back to the weight matrices by calculating gradients. TensorFlow is able do this automatically because each of the operators is required to provide a pre-calculated gradient function. TensorFlow combines these using the chain rule, so if you tell it which of the vectors and matrices you want it to update, it can run back propagation automatically.

So that should give you an idea of how TensorFlow allows you to define neural networks, and other types of models, in terms of graphs of vectors, matrices and operators. But often it’s useful to create a neural network where each layer of neurons isn’t a just single row. For example, if you’re working with images, then the input layer would be more naturally described as a rectangular grid of values. Of course, this rectangle could be encoded as a vector, but it’s more natural to think of it as a matrix, particularly for something like a convolutional net, where the sliding windows are defined in terms of a rectangle. In fact, if it’s a RGB image then you really want to think of the input as three parallel rectangles, forming a rectangular box of values.

Now, I’m about to start using the term “dimension” in a way that’s a bit different than usual, so I want to be especially careful. Recall that a vector is defined by a list of numbers of some specified length. The set of all possible vectors of a given length define a vector space whose dimension is the length that we chose. But we’re going to say that every vector, no matter its length, is a *one-dimensional tensor*. So a one-dimensional tensor can define a vector space of any dimension you want. The one-dimensional part refers to the fact that we write the values of the vector along a one-dimensional line.

A matrix, on the other hand, is a grid of numbers with a certain number of rows and a certain number of columns. The set of all matrices of a given size also defines a vector space, whose dimension is the number of rows times the number of columns. But we’ll still say that a matrix is a two-dimensional tensor. (One dimension is rows. The other is columns.) So, as promised, we have two different meanings of the word “dimension” – one for the dimension of the space defined a vector or a matrix, one for the way in which the values are arranged when they’re written down.

Similarly, the rectangular box of values defined by the three rectangular layers of our RGB image defines a three-dimensional tensor, since we think of the values as being arranged into a three-dimensional shape. The space of all possible images defines a vector space whose dimension is much larger (three times the number of pixels to be precise), but it’s still a three-dimensional tensor.

To be even more precise about this, each of the features that make up a vector can be specified by a single index *i.* Each “feature” in a matrix is specified by two indices, *i *and* j*. Each feature in the rectangular box for the RGB image is specified by three coordinates *i, j, k*. These are one-, two- and three-dimensional tensors, respectively. But there’s no reason to stop there. For example, if we want to keep track of the connections/weights between two layers we’ll need to index them by the both indices for the layer where they start and the indices for the layer where they end. For example, the weight from neuron *i, j, k* of one layer to neuron *x, y, z* of the next layer is defined by the indices *i, j, k, x, y, z*. This is a six-dimensional tensor.

TensorFlow is designed to handle tensors of any dimension, and the operators that can be used to combine them. This, combined with the abstract and general nature of its approach to defining computation graphs makes it an extremely powerful and flexible platform for building machine learning models.

]]>First, recall that each neuron in a neural network has a number of inputs and a single output. The inputs come from either the features of an input vector, or from the outputs of other neurons. The neuron stores a weight for each input, and it calculates its output by multiplying each input value by a weight, adding up the products, adding a constant defined by another weight, then applying a fixed (non-linear) function to this value. One trains a neural network by modifying the weights on the inputs to each neuron so that the overall output is closer to what you want it to be.

Before we go any further, we’re going to split each of our neurons into two: We’ll call the first one a linear neuron – this will have the same inputs as the original neuron, but its output will simply add together the input-times-weight values, plus the constant weight. We’ll call the second neuron a non-linear neuron – it will take the output from the linear neuron as its only input, and apply the fixed non-linear function to it. So if we take a neural network and replace each of the original neurons by a linear neuron feeding into a non-linear neuron, we’ll get the same functionality as the original network.

An example of this is shown below. Adding arrow heads to show input/output made the picture too noisy, so you’ll have to pretend they’re all pointing to the right. In the top network, we have four input features feeding into a layer with two (standard) neurons, which in turn feed into a layer with three neurons, then a single neuron that combines their output to determine the output of the entire network. Below that, we have the same network after we’ve replaced each standard neuron with a linear neuron feeding into a non-linear neuron.

Notice that in the network after the split, the outputs from the non-linear neurons in a given row feed into the linear neurons in the next row. Let’s focus, for a moment, on the connections from the non-linear neurons in the first non-input row, to the linear neurons in the second non-input row.

The output values from the non-linear neurons define a vector that we’ll call *V*. The outputs from the linear neurons define another vector *W*. For each of the linear neurons in this row, there is one weight for each of the non-linear neurons in the previous row, so these weights define a vector with the same dimension/number of entries as *V.* In fact, we have one such vector for each of the linear neurons, and if we stack these vectors next to each other, we get a matrix, which we’ll call *M*.

So, lets just quickly review this: We have the vector *V* of outputs from the row of non-linear neurons and the vector *W* of outputs from the successive row of linear neurons. We calculate *W* by multiplying the entries in the vector *V* by certain entries in the matrix *M* and then adding them together in a certain way. Well, if you happen to remember how matrix multiplication works, and you think carefully about the way we’re multiplying and adding the neuron outputs and weights, they turn out to be the same thing. In other words, it just so happens that the vector *W* defined by the row of linear neurons is the result of (matrix) multiplication *V *x *M*. (I’m thinking of *V* as being horizontal and the vectors of the same dimension that make up *M* as vertical.)

We can think about this as a function/transformation from the space of all possible vectors *V* to the space of all possible vectors *W*. Under this interpretation, matrix multiplication defines a *linear transformation*: Every straight line in the first vector space will be sent to either a straight line or a point in the second space. (In particular, if the dimension of *V* is higher than that of *W* then a lot of lines will have to collapse down to points, but this can happen no matter what.) So there’s our geometric interpretation of the connections between the row of non-linear neurons and the next row of linear neurons: a linear transformation.

Oh, except there’s one small detail we forgot: Each neuron doesn’t just multiply and add the inputs from the previous row of neurons. It also adds in a constant weight. The set of constant weights define yet another vector *C*, with the same dimension as W, and adding in all the constant weights corresponds to adding *C* to the product *V *x* M.* But it turns out this isn’t such a big deal – it just means that we get an affine transformation instead of a linear transformation – straight lines still stay straight or collapse to points. However, the zero vector doesn’t necessarily go the the zero vector. So we still get a nice interpretation of the connections between the layers.

This brings up an interesting point about the importance of the non-linear neurons. If we were to remove the non-linear neurons and create a neural network entirely of linear neurons, then we could interpret the entire thing as a succession of linear transformations. So, for example, to calculate the output from the third layer, we would multiply the vector from the first layer by the matrix for the connections to the second layer, then multiply the resulting vector by the matrix for the connections from the second to the third layer. However, it turns out that combining linear transformations like this, which is equivalent to multiplying matrices, can only produce new linear or affine transformations. So we wouldn’t get anything with three layers that we couldn’t get with two. It’s the non-linear part of the neurons that allows neural networks to define arbitrarily complex probability distributions.

But lets return to the linear part. In addition to being a useful way to think about the layers in a neural network, it turns out these linear transformations also allow you to do some fun tricks such as transforming sparse data into dense data.

Sparse data in this context means data in which a typical data points has most of its features equal to zero, and its structure is determined more by which features are non-zero, than by what their values actually are. A good example of this is bag-of-words (BOW) vectors – the non-zero features reflect what the words in the “bag” are, and depending on the exact type of BOW, their actual values may or may not give you additional information such as word count.

The problem with this type of vector is that they’re floppy (That’s not a technical term.): Because there are so many dimensions, there’s a serious risk of overfitting, and the curse of dimensionality makes distances less meaningful. So to really understand the geometric structure of such data, we need a meaningful way to embed the data points into a lower-dimensional space.

It turns out one can do this by first trying to solve a different problem. Lets say you wanted to predict, for any given word, what other words are most likely to appear just before and just after it in a large corpus of text. In other words, given the word “apple” and the word “pie”, you want a model that tells you the probability, if you were to randomly select an occurrence of the word “apple” in the text, that “pie” would be one of the three words right before it or of the three words right after. (You can replace three with any number, but to simplify the discussion below, I’m going to stick with three.)

The most straightforward way to do this with a neural network is to have a row of inputs, with one input for each word, then a row of output neurons, one for each word, with every input feature attached to every output neuron. If you input a vector with value 1 for “apple” and value zero for everything else, you want the “pie” neuron to output the probability described above. We can achieve this by setting the weight on the edge from input “apple” to neuron “pie” to this probability (and we can calculate the probability directly by counting the number of occurrences in the corpus).

But this uses only sparse representations of the data, with one feature per word for both the input features and the output neurons. In order to get a dense representation, we will force our neural network to generate one by inserting a smaller row of neurons between the input and the output. A Figure representing the original neural network and the network with the inserted row is below. In practice, the input and output rows would consist of thousands of features, and the middle row would have hundreds of neurons, but this should give you the idea.

For this new network, there isn’t an immediately obvious way to set the weights, so we need to train it like any other neural network. This is technically unsupervised learning since there are no labels involved, but the neural network doesn’t know that: You train the network by feeding in each word in the corpus of text and comparing the output to the bag-of-words vector defined by the three words before and three words after it. (Again, you can replace three with any number.) If the output is not what you expected, you adjust the weights using gradients/back-propagation, then repeat with the next word and so on.

Once we do this, the middle layer gives us a lower-dimensional representation of the data. In particular, while the Figure doesn’t show the neurons split into linear and non-linear pieces, the connections from the input features to the linear parts of the middle layer neurons define a linear transformation (a matrix) from the sparse input space to the lower-dimensional space defined by the middle neurons.

Given any word such as “apple”, we can form the input vector that has 1 for the corresponding feature and zero for all the others. The linear transformation defined by the first set of weights takes this sparse vector to a (dense) vector in the space defined by the middle neurons which also represents this word. In particular, recall that the neural network was trained to predict the nearby words for each input. In order for it to do this effectively, it must choose a linear transformation that sends words that commonly have similar neighbors to nearby vectors in the middle space. So synonyms are likely to be sent to very close neighbors in the middle space, and related words will be slightly farther apart.

This is roughly how word2vec works, and there are other similar schemes for training a neural network with a lower-dimensional row of neurons. Moreover, in practice this type of approach turns out to generate an embedding with even more structure than one might otherwise expect. In addition to placing synonyms near each other, word2vec also places pairs of words with similar relationships in the same relative positions as each other. For example, if you draw lines between (the lower-dimensional vectors representing) names of countries and the names of their capitals, you find that you get lines of the same length and slope (i.e. they define the same vector) so if you calculate “Paris” – “France” + “Germany” with vector arithmetic, you get a vector that is closer to “Berlin” than to any other word. (Stated another way, the vector “Paris” – “France” is extremely close to the vector “Germany” – “Berlin”.)

So, unlike the sparse bag-of-words vectors, in which any two words are treated equally, the dense word2vec representation of each word is closely related to its meaning. The geometric structure of the set of word2vec representations should therefore reflect the semantic structure of the language in a way that the sparse vectors never could. Note that the neural network doesn’t actually “know” the meanings of the words, and the training doesn’t explicitly involve the meanings in any way. However, the neural network is able to infer relationships between the “meanings” of the words based on the context provided by the corpus of text, and the vector/matrix interpretation of the connection weights in the network allow us to extract these learned relationships.

]]>

Recall that a standard (artificial) neural network, is defined by a graph of neurons (the big circles on the right), each of which takes either the features of a given data point (the small circles), or the outputs from other neurons, and calculates its own output from these values. In this way, each neuron defines a probability distribution on the space of possible input vectors. The density function of the distribution (shown inside each circle) is defined by the value that the neuron would output for any given input vector.

We can think of each probability distribution, and thus each neuron, as defining a “concept” such that its output for a given input vector defines the probability that the input vector represents the concept. In the figure, the high-probability regions are shown in white. The neurons that are connected directly to the input data define relatively simple concepts/probability distributions, while later neurons combine these simple concepts/distributions into more complex ones. In the figure, the far-right neuron’s concept is the union of the other two – it is represented by any vector that represents one or the other.

Implicit in this definition is the idea of a Directed Acyclic Graph, or DAG: Directed means that each edge in the network/graph has an arrow on it, pointing from a neuron whose output is used to the neuron that uses it. Acyclic means that if you follow these arrows, you can never go in a loop – all paths must move away from the input features and eventually make it to the output of the entire network, as with the top graph in the figure below.

If you can follow the edges of a directed graph and get back to a vertex that you already visited, this is called a *cycle*. Notice how the two blue edges in the bottom graph create cycles. These are problematic for neural networks because when you calculate the output of a neuron, you need to know the outputs of all the neurons that point into it. If there’s a cycle, there’s no neuron that you can calculate first, before all the other neurons. In the graph above, you could reorder the vertices so that the blue edges point to the right, but then some of the other edges would have to point left.

Whenever you do have a DAG, you can order the vertices (neurons) so that all the edges pointing into each vertex come from earlier vertices, i.e. all the edge point to the right. If you order the neurons in a standard neural network like this, you can calculate the outputs of the neurons in this order and know all the values needed to calculate each neuron’s output by the time you get to it.

A recurrent neural network is a neural network that is not a DAG. So as noted above, there’s no natural order on the neurons in which all the arrows point forward. But, then, how do we calculate the output values of the neurons?

It turns out the best thing to do is to carry on, pretending that there’s nothing wrong. In particular, we start by picking an ordering for the vertices, dropping the condition that the arrows defined by the edges have to point towards later vertices. This ordering will allow us to calculate the output values of all the neurons for a given input vector – for each neuron in the sequence, we use the output values from the earlier neurons that have already been calculated, and treat the outputs from the later neurons as if they were set to zero or some other default value.

But then what’s the point of the arrows that point backwards? Well, remember that the goal of recurrent neural networks is to deal with sequential data, in which we get one vector after another. For example, if we’re analyzing text, we might use bag-of-words with a moving window – use the first five words for the first vector, the second through the sixth word for the next vector, then the third through seventh and so on.

When we process the first vector, we do exactly what we described above – using a default value for the output of any neuron with an arrow that points “backwards” in the ordering. But then, when we process the second vector and get to a neuron with an arrow pointing backwards, we discover that we now have an output value for it – the value that was set when we processed the first input vector. So the neuron output values determined by the second vector in the sequence are affected by the first vector in the sequence. We repeat this for the third input vector, ending up with output values that are affected by the first two input vectors as well. So the backward arrows function kind of like memory cells within the recurrent neural network, remembering which “concepts” the earlier inputs represented.

Now, there are other, simpler ways we could create a model that takes into account more than one of these moving window vectors at a time. For example, we could concatenate the first *N* of the vectors together for some *N*, creating an input vector with *N* times as many dimensions as we started with. This is similar in principal to an n-gram. This would put the input back in standard vector form, so we could use any old model on it.

A recurrent neural network is a more complex solution than this, but it has two big advantages over techniques like n-grams or this concatenation scheme: First, the recurrent neural network works the same way no matter how long the sequence of inputs is. In other words, you don’t have to choose a value *N* before you start using it.

Second, and perhaps more importantly, a recurrent neural network reuses the same neurons for all the inputs in the sequence, allowing the overall network to be smaller/simpler. As each input vector comes in, it gets “compressed” into a set of concepts, defined by the neurons whose output will feed back into the next cycle of the network.

Any scheme for turning a sequence into a vector would need to use some method to reduce the number of dimensions, such as restricting to the most common n-grams. The recurrent neural network does this implicitly, as part of the same training process (whose description will have to wait for a later post) as the rest of the network. So the process is much more natural, and doesn’t require as many arbitrary decisions.

]]>Lets start with the way traditional CPUs work, keeping in mind that I’m not a hardware expert, so much of what I’m going to say will be intentionally vague. Whenever your computer is running, your CPU is endlessly following a list of very simple instructions involving external inputs and outputs (RAM, hard disk, your Wifi card, etc) and a small amount of memory that’s internal to the CPU called *registers*. The number of registers is usually pretty small – for example, the Intel’s fancy Core i7 processor has 16 64-bit registers.

The instructions that the CPU follows are along the lines of “Add the values in registers 1 and 2, then save the result in register 3” or “Copy the value at the memory location defined by register 1 into register 2” or “If the value of register 1 is greater than the value in register 2 then jump to the instruction number saved in register 3.” So if, for example, you wanted to add together two vectors in a 100-dimensional space, you would have to read each coordinate for each vector from RAM into a register, add the numbers, then save each value back into RAM.

Many modern CPUs have multiple cores, each of which is simultaneously and independently doing what I described above. In theory, this could speed things up a bit by doing multiple coordinates at the same time, but in practice, coordinating multiple cores is complicated enough that it’s more common to have the different cores working on completely different tasks rather than different parts of the same task. Also, the number of cores tends to be small (between 2 and 6 seems pretty typical.)

A second type of parallelism that many processors can take advantage of is what’s called Single Instruction, Multiple Data (SIMD) architecture. This allows them to find sequences of independent/parallel instructions in an algorithm and perform them all in a single cycle. So, it might add the first four values of the vectors in a single cycle, then the next four and so on. This can cut the number of cycles dramatically, but the number of parallel instructions is limited by the number of registers, usually to around 4 or 8, so we’re still far from a 100-times speedup.

Instead, the speed up comes from two major ways in which GPUs differ from GCUs. The first is that rather than having a small number of registers, a GPU has a large chunk of internal memory that it can operate on directly. So if, say, you’re going to do a lot of processing involving a collection of vectors that fits into the GPU’s internal memory, then you can save the time of shuffling the values back and forth to/from RAM. Of course, this alone only gives you a small speedup, since passing values to/from memory only takes a fraction of a CPU’s time.

The big speed up comes from the fact that each time a GPU performs an operation, it can do it many times simultaneously. And it’s more than 2 or 6. Instead, 64 seems to be a typical size for the number of operations a GPU can do in parallel. Rather than an instruction like “Add register 1 to register 2” like the CPU had, a GPU instruction may be something like “Add the values in locations 1-64 to the values in locations 65-128, and save them in locations 129-192.” And this operation is done in a single step, simultaneously by 64 separate circuits within GPU. In other words, you can think of a GPU as having a row of CPUs that (unlike the multiple cores in a CPU) all follow the same instruction at the same time on different parts of the internal memory.

So now, when we add those 100-dimensional vectors, instead of reading in 200 values, adding them in 100 separate cycles, then transferring 100 values back to RAM for a total on the order of 100 consecutive operations (not to mention a bunch of overhead I’m glossing over), we only need two cycles of the GPU. We would still need to transfer the values in and out of the GPU’s internal memory, but if we’re doing a lot of processing on the same vectors, we can minimize this time by keeping them in the GPU’s memory until we’re done with them.

So tasks that involve doing the same thing at the same time to lots of different data (such as vector and matrix operations) can be done much faster on GPUs. In fact, it’s because matrix operations are so important to computer graphics that GPUs were designed this way. Note that GPUs tend to be slower than CPUs in terms of the number of cycles per second, plus they lack many optimization features that modern CPUs have. So for tasks that can’t take advantage of parallelism – i.e. almost everything other than vector and matrix operations – CPUs are much faster. That’s why the computer you’re working on right now has a CPU at its center instead of a GPU.

But the processes involved in training and evaluating a neural network happen to fit very nicely into the vector/matrix genre. The “knowledge” in a neural network is defined by the weights on the connections between neurons. For example in a network with rows of neurons, the weights between successive rows are defined by a matrix in which the entry at position *(i, j)* is the weight from the *i*th neuron in the first row to the *j*th neuron in the second row. Each row, in turn, defines a vector, and we calculate the output from each neuron by multiplying the outputs of the first row by this matrix, then applying a non-linear function to the resulting vector. We do this for each successive row until we get to the end of the network. Training the network via back-propagation is another process involving these same vectors and matrices.

As a result, it’s possible in practice to work with much larger neural networks than would be otherwise possible, even after a few more decades of Moore’s Law. This is important, for example, in image processing where the first row alone (i.e. the input) contains thousands of neurons. Things still get tricky when the networks get too big to fit in the memory of a single GPU. At that point multiple GPUs are required to store the network, and data must be transferred between them, which becomes the major bottleneck. But that’s a whole different story. For now, this is at least the rough idea behind why GPUs have been one of the main drivers of the recent success of large-scale neural networks.

]]>