I’m going to start this post with a confession: Up until a few days ago, the only thing I knew about p-values was that Randall Munroe didn’t seem to like them. My background is in geometry, not statistics, even though I occasionally try to fake it. But it turns out that a lot of other people don’t like p-values either, such as the journal Basic and Applied Social Psychology which recently banned them. So I decided to do some reading (primarily Wikipedia) and it turns out, like most things in the world of data, there’s some very interesting geometry involved, at least if you know where to look.

The goal of a p-value is to answer the following question: Given a probability distribution and a collection of data points, how likely is it that these points were sampled from the distribution? If the data points are unlikely to have occurred, this can be interpreted as evidence that data points came from a different distribution. For example, lets say we were to measure the heights of 1000 adults from the US, then calculate the average height and the standard deviation. We could use this to define a Gaussian (i.e. bell curve) probability distribution, centered at the average and with the same standard deviation, which should describe the heights of the overall population. This is shown below as the graph of the Gaussian function above the points, and as a heat map of the density function below the points.

Now, it just so happens that 100 out of the 1000 adults that we measured were left-handed. If we take the average height of these 100 adults, we’re bound to get something at least slightly different from the overall average. But lets say the average height of the left-handed adults is a full two inches higher than the average for the overall population. Can we conclude that on average, left handed adults are taller than right-handed adults? In other words, should we conclude that the left-handed samples follow a different distribution whose peak is two inches higher? Or is it more likely that this was just a statistical quirk – a random act of probability? A slightly more precise question what’s the probability that a completely random sample of 100 people out of the original 1000 would have an average height two inches taller than the overall average? If this probability is high, then the higher left-handed average is probably just a quirk. If not then… well, this is where the controversy lies.

Note that the answer to this question is actually zero, because of a kind of silly technicality: In this situation, there are infinitely many possible numbers we could get, so the probability of getting any one particular number is zero. In order to get a non-zero probability, you have to look at a range of numbers. Note that I may have occasionally written things in past posts that suggested otherwise, but if you go back and read them, you’ll see that I was intentionally vague on the matter.

The standard way to deal with this issue when it comes to p-values is to ask what’s the probability that a random sample would have the found value, or a more extreme value. In the height example, this means we want to calculate the probability that a random sample of any 100 adults would have an average height 2 inches *or more* above the overall average. This probability is the *p-value* of the result. A lower p-value means that it’s less likely that a given difference was only the result of a statistical quirk, and is often interpreted to mean that the phenomenon it suggests must be real. The problem, as the XKCD comic points out, is that by definition, some fraction of the time, a result with a low p-value will, in fact, be caused entirely be a statistical quirk.

But rather than getting into the pedantic business of interpreting p-values, lets get to the geometry. It turns out things get interesting when you try to actually calculate the probabilities involved. To do this, we’re going to have to start small and work our way up. In fact, we’ll start with one: What is the probability that a single person chosen at random will be two or more inches taller than the overall average?

Recall that we decided to model the heights of the whole population as a probability distribution with a Gaussian density function, i.e. a bell curve like the one shown below. The peak of the bell curve is at the overall average height, and two inches taller than that is just a bit to the right of the peak. The bell curve is carefully constructed so that the region below it has area exactly 1. The probability of choosing someone whose height is two or more inches above the average is the area of the smaller region below the bell curve and to the right of this plus-two-inches mark. Note that even though this region is infinitely wide, its area is finite, and in fact less than one. Once you wrap your head around that (or just ignore it and move on,) this is conceptually pretty simple, even if the actual calculation may not be fun.

On to the next step: Two people. We would expect the chances of picking two people whose average is at least two inches above the overall average to be slimmer than picking just one, since either both people would have to be at least one or two inches above average, or maybe one of the people would have to be at least four inches above average, and the other one not too much below average. But coming up with an explanation for this in terms of probability distributions is a bit trickier.

If we choose two people, then their two heights will define a vector in the two-dimensional space of all possible pairs of heights. This is sort of like a configuration space. We would like to understand the probability of choosing a vector in different regions of this space. In other words, we want to understand the probability distribution on all possible pairs of heights. It turns out that for any point *(x, y)* in the plane, the density function for this two-dimensional probability distribution will be simply the height of the bell curve at *x* times the height of the bell curve at *y*. Because of the properties of exponentials, this turns out to define a two-dimensional Gaussian distribution like the one shown as a heat map below.

Once we have this density function, we can imagine graphing it in three-dimensional space above the two-dimensional data plane, though I won’t try to draw this. The result is a mountain with a single peak. The probability of picking a pair of heights in a given region in the two-dimensional configuration space is the volume of the region below the mountain graph, and above that region.

The region we’re interested in is the set of all points whose average is two or more inches above the overall average height of the population. This turns out to be the region above and to the right of the diagonal line shown in the Figure that passes just above and to the right of the peak of the mountain. In fact, the closest it ever gets to the peak is (just under 3) inches away.

Comparing the probability defined by this volume to the probability defined by the earlier area below the bell curve is a bit tricky, since the dimensions are different. But it turns out that because the line only comes within inches of the two-dimensional peak, the probability it defines is the same as the area below the original bell curve, but starting inches to the right of the peak, rather than the two inches away we originally had. As you can see in the Figure below, because of the way bell curves are shaped, this means that the new probability is much lower. Understanding why these last calculations work is a bit tricky, and you can either try to prove it yourself, or take my word for it. But conceptually, at least, this idea of the volume below the mountain should again seem pretty straightforward.

And that means we can go the next step to three people. Again, we would expect the probability to go down even more, but we’ll look for a geometric explanation. With three heights, we’re now looking at a point in three-dimensional space. We’ll end up with a probability distribution, which we can either think of as a Gaussian cloud in three dimensions that’s really dark at its center and gets lighter farther out, or as a graph in four dimensions sitting “above” our three-dimensional configuration space. (I won’t try to draw either.) The second one is a bit tricky, but it allows us to talk about the four-dimensional volume below the graph, and above some region in the three-dimensional configuration space.

The region that we’re interested in is on one side of a plane in this three-dimensional space. On the other side of the plane is the center of the Gaussian cloud, exactly inches away from the plane. Comparing a four-dimensional volume to the two-dimensional area is even harder than comparing a three-dimensional volume to area, but it turns out we can use the same trick that I alluded to earlier: The probability defined by this region turns out to be equal to the area under the bell curve and to the right of the mark that’s inches to the right of the center of the original bell curve, and this is even smaller than the earlier two areas we looked at.

Hopefully, you’ve started to notice a pattern, which will save us from trying to walk through the next step to four people. The key is that for any number *n* of people that we want to randomly choose, we can think of the sets of possible heights as defining a configuration space with a high-dimensional Gaussian blob, and a region defined by a hyperplane that is inches away from its center. Then we can translate this into a two-dimensional area under the original bell curve, and see that the area shrinks as *n* grows.

As usual, I’m not going to do the calculations to work out the area for this particular example. In fact, I can’t do this anyway, since I never made up a standard deviation for the original sample of 1000 adults. But rest assured that if I had then this description and a bit of calculus would be enough to calculate the exact area, a.k.a. the p-value for the observation of the 100 left-handed adults. And then you’d be free to interpret or misinterpret as you please.

Very nice take on the p-value.

Reblogged this on RANDOM THOUGHTS and commented:

This post contains a nice geometric intuition about p-value. Worth a read.

Actually the answer is not zero, unless you wander off into continuous theory land. There a 1000 data points, and 1000 Choose 100 possible outcomes. These outcomes can be summarized (means) and the summaries ordered. It is then possible to count the number equal to 2 or in any range you are interested in. No probability involved. In practice one would sample from the set of combinations to whatever degree of accuracy you like. See Randomization and Permutation distributions, or resampling in general.

That’s a good point – I was referring to the continuous case when I said the probability was zero, i.e. after we switch from the original sample of 1000 points to a normal distribution inferred from the points.

True, but real data is finite and discrete. If you look at an empirical sampling distribution (via permutation, subsampling or bootstrap) they tend to be discrete and non-monotone in the tails, and usually deviate from a nice normal distribution. Normal theory is great for deriving general properties.

Just wanted to add some more encouragement, I really enjoy your blog! Thanks for writing this up.

Thanks for this. I finally feel like I understand p-values.

Reblogged this on Learning Data Science Everyday.

I dont knoe that p value can be describe this detail

thank you so much 🙂

Reblogged this on Sequencing QC and data analysis blog and commented:

A good explanation of p-values

Lovely. Is your argument along these lines? The joint pdf is a function of (x^2 + y^2), so rotationally symmetric, so you can marginalize parallel to x-axis instead of that plane, which just recovers the original standard normal in your second figure. For a practical education in p-values, I highly recommend this recent blog post: https://liorpachter.wordpress.com/2015/05/26/pachters-p-value-prize/

Yes, that sounds like the argument I had in mind. The key is that projecting the two-dimensional Gaussian in any direction produces the same one-dimensional Gaussian because of rotational symmetry.