Note: I’ve started announcing new posts on twitter (@jejomath) for anyone who wants updates when new posts appear.
In the last two posts, I described how a single neuron in a neural network encodes a single, usually simple, classifier algorithm, then how multiple neurons can be linked together to combine the models from these classifiers into geometrically more complex shapes. In particular, the second of these posts explained the evaluation phase of a general neural network. In this week’s post, I’ll explain how our understanding of the evaluation phase can be used to explain the training phase of the a general network, through an algorithm called back propagation.
We’ll start with the simple neural network from last week’s post with three neurons shown below. Recall that each of the first two neurons defines a line that divides the two-dimensional data space into two regions. Together, these two neurons divide the plane into four regions and the third neuron determines which of these four regions get labeled 0 or 1.
As I described in the first post on neural networks, we can train a neural network one point at a time by a method very similar to the maximum likelihood estimation that we used for regression: For each new data point, we evaluate it using the neural network. If the answer that we get isn’t correct, then we determine how to adjust the parameters to make it closer. We then adjust the parameters accordingly and repeat for the next data point.
For a single neuron, with a logistic distribution, this is very straightforward: The decision boundary is a line, plane or hyperplane that separates the data space into two sides with the different labels. As in the first post, if a data point is on the wrong side then we move the decision boundary towards the point, in the hope of eventually moving it so that the point is on the correct side. If the a data point is on the correct side then we move the decision boundary away from it in order to increase the margin of error in the final distribution. This behavior was determined by the slope (or technically, the gradient) of the distribution at the data point, which told us which direction would increase or decrease the value.
For the distribution defined by a general neural network, we can think of the final distribution as a composition of functions, in the sense of calculus. From this perspective, the chain rule (or rather the multi-dimensional chain rule that one usually sees in Calculus 3) allows you to calculate the gradient of the final distribution in terms of the distributions defined by the individual neurons. This is how most neural networks determine how to adjust the parameters of each neuron, a procedure called back propagation.
For the example neural network above, lets see what would happen when a point is misclassified. In the last post, we considered parameters on the neural network that produced decision boundaries on the first two neurons making the four regions shown on the left below. If we choose a decision boundary on the third neuron as on the right, then the final distribution will be the one shown in the center.
The black point shown in the middle picture is misclassified, so if it shows up during the training phase, we will need to adjust the parameters. The first two neurons both see the point as being on the wrong side of their decision boundaries, so each of these decision boundaries gets moved towards the point, as indicated by the arrows in the middle picture. The last neuron also sees the point as being on the wrong side, so it also moves its decision boundary towards the point, as on the right. If the data point were white, it would have exactly the opposite effect on all three neurons.
To understand what adjusting this last neuron does to the final distribution, recall that the upper region in the middle picture, which appears to be white, is actually a very light shade of gray. Moving the decision boundary on the right towards the upper right corner has the effect of making this region slightly darker. Thus if there are enough black data points in this region, the final neuron will eventually switch the region to black, while white data points in one of the other corners may switch a different region to white.
So this neural network learns in two ways: The first two neurons try to find the decision boundaries that will split the data space into the best four regions. Then the third neuron tries to choose the shades for the four regions that best match the data.
In more complex neural networks with more layers of neurons, the neurons in the first layer will still determine the regions, while the later neurons will determine different ways to combine them. At first it may seem strange that we would need more than one layer to choose the regions, but remember that each of these later neurons is very restricted in the ways that it can combine the regions, the same way the initial neurons are restricted to linear decision boundaries.
Experts often think of it as having each neuron record a single concept. The early neurons record very primitive, very specific concepts. Then later neurons combine these specific concepts into more general concepts. For example, in the neural network that google research recently announced for image processing, we would expect the first layer of neurons to correspond to very basic (and very common) shapes. Later neurons record how these shapes can be combined into specific objects from specific directions, such as a tabby cat’s ear from the front, or a tabby cat’s ear from the side, and so on. (Cat pictures seem to be pretty popular these days.) Then a later neuron might combine the neurons corresponding to different parts of a tabby cat from different angles, to record a picture of a single tabby cat in one particular position (probably sleeping) from one particular angle. Then a later neuron might combine all the earlier neurons for different angles of a tabby to record the concept of a tabby cat. A later neuron might combine the neurons for all different types of cats and so on. (Of course, google’s neural network may combine specific concepts/pictures in a completely different way, but you get the idea.)
The more layers of neurons you have, the more complex the final “concepts” recorded by the network can be. However, one issue with early neural networks was that as you add layers, the back propogation method does not adjust the early layers as much. In other words, the regions defined by the first layer of neurons tend to stay put, while the later neurons attempt to make up for this by switching the shades of the poorly chosen regions back and forth.
The recent resurgence of interest in neural networks (or deep learning, a slightly broader area in which neural networks are a major part) has been partially fueled by more sophisticated training techniques that, when combined with back propagation, ensure that the early neurons are properly trained. (Another major factor in the resurgence of neural networks is the availability of computing power, which makes it possible to cycle through the training data more times with back propagation, which also greatly improves the final result.) However, a discussion of these methods will have to wait for a far future post. Instead, in the next few weeks I plan to discuss some other algorithms that combine simple models/distributions to make more complex ones.