Convolutional neural networks

Neural networks have been around for a number of decades now and have seen their ups and downs. Recently they’ve proved to be extremely powerful for image recognition problems. Or, rather, a particular type of neural network called a convolutional neural network has proved very effective. In this post, I want to build off of the series of posts I wrote about neural networks a few months ago, plus some ideas from my post on digital images, to explain the difference between a convolutional neural network and a classical (is that the right term?) neural network.

First, let me quickly review the idea behind a neural network: We start with a collection of neurons, each of which takes a collection of input values and uses them to calculate a single output value. Then we hook them all together, so that the inputs to each neuron are attached to either the outputs of other neurons or to coordinates/features of a data point that is fed into the network.

When you input a data point into neural network, the outputs of the first level of neurons are calculated, then they feed into the later neurons and so on until all the neurons have set their outputs based (directly or indirectly) on the input data. Abstractly, we can think of each neuron in a neural network representing an “idea”. The output of the neuron should be a value close to 1 if that “idea” is present in a given input data point, and close to 0 otherwise. The earlier neurons will represent relatively simple, low-level ideas, while the later neurons represent higher level, more abstract ideas that are combinations of the ideas defined by the earlier neurons. This perspective is kind of hard to grasp in general, but it starts to make sense in more specific contexts, such as analyzing pictures.

If we’re making a neural network to analyze images, then the input to the neural network will be a vector like we saw in the post on digital images: each dimension will represent how light one of the pixels is (or one of its RGB values if it’s a color image, but for simplicity lets stick to grey-scale). We saw that when we encoded images as vectors this way, vectors that were nearby in the data space corresponded to images that matched up very closely. We can use this fact to understand how the neurons in a neural network respond to an image.

The standard way for a neuron to compute its output is to take a weighted sum of its input values, then apply a function with a steep drop-off that that sends all values below some threshold to values near 0, and all valued above that threshold to values near 1. By a weighted sum, I mean that each input is multiplied by a preset value (or, rather, a value that is set during the training phase), then the results are all added together. These preset values define a vector with the same number of features/dimensions as vector defined by the input values, and this weighted sum is essentially a dot product of the two vectors.

If you don’t remember what a dot product is, don’t worry – all you need to know is that (under the appropriate assumptions that I’ll gloss over) the resulting value is higher when the two vectors are closer together, and goes to zero as the vectors move farther apart. (Gory details: The dot product of two unit vectors is the cosine of the angle between them, and cosine is close to one for small angles, then goes to zero as the angle increases, up to a right angle.)

So, for the neurons that get their input directly from each incoming data point, we can interpret this as follows: The fixed vector of weights defines an image. The neuron calculates its output by comparing the input image to this fixed image, and if these images are reasonably close, its output is close to 1. Otherwise, it’s close to 0. If we want these front-line neurons to encode basic, low-level ideas, then we probably don’t want them to do this with the whole image. Instead, we can have each of these neurons hooked up to the pixels that form a small rectangle somewhere in the image, and each one will check whether the part of the image in that rectangle matches some fixed image (that was “learned” during the training process.) So, for example, if the fixed image looks like an eye, as in the Figure on the right, then the neuron will output a value close to 1 if there is a similar looking eye in its particular rectangle. There might be another neuron that checks if the image in the rectangle looks like a hand (or a paw?) and so on.

But here’s the problem: Since each neuron is hooked up to a particular rectangle of pixels, what happens if there’s an eye in the picture just a little bit to the left of the rectangle that our neuron is looking at (such as on the right side of the Figure)? Well, its rectangle will only see part of the eye, which won’t match very well. (The dot product only works well if the images match up pixel-to-pixel.) So we need another neuron to be in charge of checking whether the rectangle just to the left of the first neuron’s rectangle has an eye in it. In fact, we would need to fill the whole image with rectangles like this, and have a separate neuron for each of them.

But this introduces two new problems: First, we no longer have a single neuron representing a single “idea” – we have thousands of neurons representing the same idea, which means we need a much larger neural network to encode a relatively small number of ideas.

Second, remember that the weights used by each neuron come from a training process in which data is fed into the network, and the output is compared to a desired value. When the output is different from the desired value, the weights are changed in a way so that the output will be closer to correct next time. Because of the way that dot products work, this means that the first level of neurons will try to adjust their weights to match the data points that come in during the training phase. In order for one of our neurons that’s watching a particular rectangle to end up with weights that define the image of an eye, you need to have lots of input images with an eye in that particular rectangle. In order for the trained neural network to recognize an eye in a future image, there must have been a training image with an eye in the exact same spot.

A convolutional neural network attempts to fix these two issues by allowing a single neuron to watch many different rectangles in each incoming image. In other words, for each of the incoming rectangles, a convolutional neuron will calculate a dot product between its fixed vector and the vector defined by this rectangle, then it will combine all the values in some way, such as by adding them together or taking a maximum.

Lets say we have a convolutional neuron that takes the maximum over all these rectangles and whose fixed vector defines the image of an eye. Then the output from this neuron will be close to 1 whenever there is an image resembling an eye (at the correct scale) anywhere in the picture. So this neuron captures the low-level idea of “an eye” in a much more useful way than a standard neuron.

Things get slightly more complex during the training process, but recall that the standard way of training a neural network is mostly agnostic to how the neurons calculate their output values. In particular, you only need to be able to determine what changes to the weights in the neuron will improve the orverall score for any given input vector. This is slightly more complex for a convolutional neuron than for a standard neuron, but once you get it right, the first layer of neurons in a convolutional neural network will begin adapting themselves to small images that commonly appear anywhere in the input images that you use to train the network.

1. xiaojidan says: