Neural networks, linear transformations and word embeddings

In past posts, I’ve described the geometry of artificial neural networks by thinking of the output from each neuron in the network as defining a probability density function on the space of input vectors. This is useful for understanding how a single neuron combines the outputs of other neurons to form a more complex shape. However, it’s often useful to think about how multiple neurons behave at the same time, particularly for networks that are defined by successive layers of neurons. For such networks – which turn out to be the vast majority of networks in practice – it’s useful to think about how the set of outputs from each layer determine the set of outputs of the next layer. In this post, I want to discuss how we can think about this in terms of linear transformations (via matrices) and how this idea leads to a tool called word embeddings, the most popular of which is probably word2vec.

Continue reading

Posted in Neural Networks | 10 Comments

Recurrent Neural Networks

So far on this blog, we’ve mostly looked at data in two forms – vectors in which each data point is defined by a fixed set of features, and graphs in which each data point is defined by its connections to other data points. For other forms of data, notably sequences such as text and sound, I described a few ways of transforming these into vectors, such as bag-of-words and n-grams. However, it turns out there are also ways to build machine learning models that use sequential data directly. In this post, I want to describe one such approach, called a recurrent neural network.

Continue reading

Posted in Neural Networks | 5 Comments

GPUs and Neural Networks

Artificial neural networks have been around for a long time – since either the 1940s or the 1950s, depending on how you count. But they’ve only started to be used for practical applications such as image recognition in the last few years. Some of the recent progress is based on theoretical breakthroughs such as convolutional neural networks, but a much bigger factor seems to be hardware: It turns out that small neural networks aren’t that much better than many simpler machine learning algorithms. Neural networks only excel when you have much more complex data and a large/complex network. But up until recently, the available hardware simply couldn’t handle such complexity. Moore’s law helped with this, but an even bigger part has been played by a type of chip called a GPU, or Graphical Processing Unit. These were originally designed to speed up computer animations, but they can also be used for other types of processing. In some cases, GPUs can be as much as 100 times as fast as standard CPUs at certain tasks. However, it turns out you only get this speedup with a fairly narrow category of tasks, many of which happen to be necessary for processing neural networks. In this post, I want to discuss what types of task these are and why GPUs are so much faster at them.

Continue reading

Posted in Neural Networks | 11 Comments

Genetic algorithms and symbolic regression

A few months ago, I wrote a post about optimization using gradient descent, which involves searching for a model that best meets certain criteria by repeatedly making adjustments that improve things a little bit at a time. In many situations, this works quite well and will always or almost always finds the best solution. But in other cases, it’s possible for this approach to fall into a locally optimal solution that isn’t the overall best, but is better than any nearby solution. A common way to deal with this sort of situation is to add some randomness into the algorithm, making it possible to jump out of one of these locally optimal solutions into a slightly worse solution that is adjacent to a much better one. In this post, I want to explore one such approach, called a genetic algorithm (or an evolutionary algorithm), which I’ll illustrate with a specific type of genetic algorithm called symbolic regression. I first heard about this in an article about a small company in Somerville, MA called Nutonian that has built a whole data science platform around it.

Continue reading

Posted in Modeling, Regression | Leave a comment


I’m going to start this post with a confession: Up until a few days ago, the only thing I knew about p-values was that Randall Munroe didn’t seem to like them. My background is in geometry, not statistics, even though I occasionally try to fake it. But it turns out that a lot of other people don’t like p-values either, such as the journal Basic and Applied Social Psychology which recently banned them. So I decided to do some reading (primarily Wikipedia) and it turns out, like most things in the world of data, there’s some very interesting geometry involved, at least if you know where to look.

Continue reading

Posted in Fundamentals | 12 Comments

Convolutional neural networks

Neural networks have been around for a number of decades now and have seen their ups and downs. Recently they’ve proved to be extremely powerful for image recognition problems. Or, rather, a particular type of neural network called a convolutional neural network has proved very effective. In this post, I want to build off of the series of posts I wrote about neural networks a few months ago, plus some ideas from my post on digital images, to explain the difference between a convolutional neural network and a classical (is that the right term?) neural network.

Continue reading

Posted in Classification, Feature extraction | 3 Comments

Precision, Recall, AUCs and ROCs

I occasionally like to look at the ongoing Kaggle competitions to see what kind of data problems people are interested in (and the discussion boards are a good place to find out what techniques are popular.) Each competition includes a way of scoring the submissions, based on the type of problem. An interesting one that I’ve seen for a number of classification problems is the area under the Receiver Operating Characteristic (ROC) curve, sometimes shortened to the ROC score or AUC (Area Under the Curve). In this post, I want to discuss some interesting properties of this scoring system, and its relation to another similar measure – precision/recall.

Continue reading

Posted in Classification | 3 Comments