In the last post, I introduced the Support Vector Machine (SVM) algorithm, which attempts to find a line/plane/hyperplane that separates the two classes of points in a given data set. This algorithm adapts elements of linear regression, a statistical tool (namely, approximating the data set with a relatively simple model) to the classification problem, which has its roots in computer science. As we saw, one of the problems with SVM is that the output ultimately depends on a relatively small number of data points, namely the support vectors. In this post, I’ll discuss logistic regression, an algorithm that attempts to fix this problem by adapting the regression more directly to the problem of classification.

Recall that in linear regression, we used lines in the plane (or planes/hyperplanes in higher dimensions) to describe different probability distributions. We then chose the line such that the corresponding distribution assigned the maximal probability to the given data points. This general technique is often called *maximum likelihood estimation*. (Correction: I originally said *expectation maximization* rather than *maximum likelihood*. See the comments for an explanation of the difference.)

For linear regression, we defined each distribution to have probability 1 directly on the line and to have the probability go to 0 as we moved farther from the line (following a Gaussian bell curve, shown on the left in the Figure below). This way, maximizing the probability was equivalent to drawing the line as close as possible to the points. But if we have two classes of points, we don’t want the line to be close to all the points – We want it to be in the space between the two classes of points, in some sense as far as possible from the points. In particular, we need to define an maximum likelihood problem that takes into account the two different types of data points.

We’ll do this by thinking of one of the classes of data (such as the blue points) as being at height one above the plane, while the other class (the green points) are right in the plane. We’ll then choose a distribution that is close to the plane on one side of the line, but above the plane on the other side. A cross section of this distribution is a curve called the *logistic function,* shown in the center of the figure. (This is where logistic regression gets its name from.) The 3D picture, which applies to two-dimensional data, is shown on the right. For higher dimensional data, we have a very similar situation, though (as always) it’s much harder to visualize.

Recall that we found the best distribution for linear regression by choosing the one in which the data points are in the darkest possible parts of the distribution. Alternately, we can think of this as putting all the data points at height one above the plane, then choosing the distribution where the Gaussian function is as close as possible to these points, as on the left in the second Figure. For logistic regression, since we’ve put our blue points at height one and our green points at height zero, we will again want to choose the distribution that minimizes the distances from the distribution function (this time, the one based on the logistic curve) to the points above and in the plane. We can think of this as having the data points pull (or push?) on the distribution.

In some ways, this is reminiscent of the Support Vector Machine algorithm, in which we expand a region around the dividing line/plane/hyperplane, until it bumps into data points on either side, then allow the data points to “push back” on the region, forcing the line/plane/hyperplane into a position that maximizes the buffer on either side of the line. As a result, only the data points that the buffer touches (i.e. the support vectors) push on it.

With logistic regression, on the other hand, every data point pushes on the distribution, though not with equal force. Like springs, the points that are closer to the dividing line/plane/hyperplane (not to mention the points that are on the wrong side) push harder because the line from the point to the distribution function is longer. (This is also reminiscent of how I described PCA.) The green points push the dividing line towards the blue points and the blue points push it back towards the green. The distribution that the logistic regression algorithm chooses can be thought of as the equilibrium point of all these forces.

Another way to think about it, closer to the picture in the linear regression post, is that instead of a distribution that looks like a dark cloud concentrated near the line, we have a distribution that is mostly blue on one side of the line, mostly green on the other side, and has a transition in the middle. We then choose among all the models corresponding to the different lines, by picking the one in which the green data points are in the greenest parts of the distribution and the blue points are in the bluest parts of the distribution. In the example above, you can see that the distribution on the left includes blue points in the green part of the distribution and vice versa. The distribution on the right does a much better job of matching blue to blue and green to green, so this would be closer to model chosen by logistic regression.

It’s worth mentioning that while there is a simple equation for linear regression that immediately gives you the parameters of the best fit line/plane/hyperplane, there is no such equation for logistic regression. Instead, the algorithm is usually implemented by choosing a line/plane/hyperplane at random, then calculating which direction to move it to improve the fit (or, equivalently, what direction the forces of the data points are pushing it in). The process is then repeated on the new line/plane/hyperplane over and over until the combined forces get close enough to equilibrium that repeating further won’t make a noticeable difference.

So this reduces the problem of susceptibility to noise that we saw with SVM. If the data is very noisy/inaccurate then logistic regression will, of course, give you a lousy answer. But unlike SVM, if there are a small number of mislabeled points, outliers, or points with a few incorrect coordinates, then the remaining correct points will (hopefully) outweigh it. On its own, logistic regression still restricts us to a relatively simple model, namely dividing the space of data points along a single line/plane/hyperplane. We were able to modify the linear regression algorithm to more general models by replacing the equation of the line with a more complicated equation describing a curve, or in general replacing the plane/hyperplane with a higher dimensional curved shape. However, rather than doing this directly with logistic regression (and even with SVM), it is more common to use a general method called a *kernel. *But a discussion of kernels will have to wait for a future post.

I think you have your terminology slightly wrong. As far as I know, Expectation Maximization is a technique to fill in incomplete data (like cluster membership in KMeans). OTOH Maximum Likelihood Estimation is the technique/idea to use the probability a model assigns to real data as the objective function to optimize when searching over models (eg, weights for logistic regression). EM is used in latent variable variable models when doing MLE. But you can optimize simple models using MLE without EM.

Thanks for pointing that out. (You can tell my background is not in statistics.) Wikipedia seems to confirms this: expectation maximization is a way of assigning labels to unlabeled data points, based on a number of existing distributions. Maximum likelihood is a process like what I describe above, where you fit a model by maximizing the probability it assigns to given labeled data. I’ve made the appropriate changes above.

You still say expectation maximization in the paragraph below.

I’ve personally found plots of the loss functions themselves interesting/enlightening.

http://research.microsoft.com/en-us/um/people/manik/projects/trade-off/hinge.html

EG, you can see that hinge and logistic loss aren’t that different, you can see the hinge loss ignoring well classified points, where as the logistic loss only asymptotes in that direction.

Got it. Thanks!

Pingback: Kernels | The Shape of Data

Pingback: Multi-class classification | The Shape of Data

Pingback: Neural Networks 1: The neuron | The Shape of Data

Pingback: Mixture models | The Shape of Data

Pingback: Gaussian kernels | The Shape of Data

Pingback: Intrinsic vs. Extrinsic Structure | The Shape of Data

Pingback: The shape of data | spider's space

Pingback: Case study 1: Iris | The Shape of Data

Pingback: Cast study 2: Tokens in census data | The Shape of Data

Pingback: Case study 3: Free form text | The Shape of Data

Pingback: Case study 4: Resonance and Robots | The Shape of Data

Pingback: PageRank | The Shape of Data

Pingback: Precision, Recall, AUCs and ROCs | The Shape of Data

Hi Jesse,

This is a great resource for beginners like me to start thinking about topology and data analysis. I’ve been following the Table of Contents posts up until this point and have understood everything so far (being new to maths that says something about how well you have written this). But I just couldn’t get my head around this bit:

“We’ll do this by thinking of one of the classes of data (such as the blue points) as being at height one above the plane, while the other class (the green points) are right in the plane. We’ll then choose a distribution that is close to the plane on one side of the line, but above the plane on the other side.”

I’m really struggling to visualise what you mean here (maybe I’ve been staring at it too long!), and how it relates to the figures above and below this bit. Is it possible for an extra graphic or another way to phrase this to help mortals like myself understand?

Many thanks, please keep writing!