One of the common themes that I’ve emphasized so far on this blog is that we should try to analyze high dimensional data sets without being able to actually “see” them. However, it is often useful to visualize the data in some way, and having just introduced principle component analysis, this is probably a good time to start the discussion. There are a number of types of visualization that involve representing statistics about the data in different ways, but today we’ll look at ways of representing the actual data points in two or three dimensions.
In particular, what we want to do is to draw the individual data points of a high dimensional data set in a lower dimensional space so that the hard work can be done by the best pattern recognition machine there is: your brain. When you look at a two- or three-dimensional data set, you will naturally recognize patterns based on how close different points are to each other. Therefore, we want to represent the points so that the distances between them change as little as possible. In general, this is called projection, the term coming from the idea that we will do the same thing to the data as you do when you make shadow puppets: We project a high dimensional object (such as your three-dimensional hands) onto a lower dimensional object (such as the two-dimensional wall).
We’ve already used linear projection implicitly when thinking about higher dimensional spaces. For example, I suggested the you think about the three-dimensional space that we live in as being a piece of paper on the desk of someone living in a higher dimensional space. In the post about the curse of dimensionality, we looked at the three-dimensional cube from the side and saw that it was two-dimensional, then noted that a four-dimensional cube would look like a three-dimensional cube if we could look at it “from the side”.
When we look at an object from the side like this, we are essentially ignoring one of the dimensions. This the simplest form of projection, and in general we can choose to ignore more than one dimension at a time. For example, if you have data points in five dimensions and you want to plot them in two dimension, you could just pick the two dimensions that you thought were most important and plot the points based on those. It’s hard to picture how that works because you still have to think about the original five dimensional data. But this is similar to the picture if we were to take a two-dimensional data set and throw away one of the dimensions as in the left and middle pictures in the Figure below. You can see the shadow puppet analogy too: In the figure on the left, the light is to the right of the data, while in the middle, it’s above.
Notice that the two-dimensional data set appears to be made up of three groups of three data points but when we project onto the vertical or horizontal axis, we see one group of six and a group of three. This is because in each of these directions, one of the group of three is hiding behind one of the others. So both projections are deceptive and in both projections we lose information. This is not surprising since we’re essentially throwing away all the information in one of the dimensions.
In particular, when we look at the vertical projection on the left of the Figure, each point could actually be anywhere along the horizontal line that it sits in. This is similar to the fact that when you see a dark spot in a shadow puppet, you know it comes from some part of your hand, but you don’t know how far from the wall that part of your hand is – There is a whole line of possibilities from the light bulb to the wall. That’s precisely why we can make such a variety of apparent shapes with shadow puppets. The more dimensions we throw away, the worse it gets: If we project from five dimensions to two, then each point in the two-dimensional plot could be anywhere in a three-dimensional hyperplane in the original five-dimensional space.
We can sometimes improve things a little by projecting along an angle rather than directly along the features. This is called a linear projection (even though we’re projecting onto a plane or a hyperplane rather than a line.) For example, if we project onto the diagonal line shown in the picture on the right of the Figure, the three groups of three data points actually look like three separate groups in the projection. The groups don’t look as well separated, but at least they’re no longer hiding behind each other.
From any angle, the projection is the result of crushing/flattening the data, so we still lose information and the shape of the data will be distorted in some way. The way to minimize the distortion is to crush along the directions where the data is the least spread out and to preserve the dimensions along which it is the widest. In the previous post on Principal Component Analysis, I discussed how to find the directions in which a data set is the most “spread out”, and we called the vectors pointing in these directions the principal components. So this suggests that projecting onto the longest two or three principle components (depending on whether we want a two-dimensional or three-dimensional projection) should produce a relatively good picture of the data.
The picture that results from calculating the principal components, then using them to create a linear projection is often called a PCA projection. For example this is (pretty close to) the projection on the right in the Figure above. Like any form of projection, there will still be distortion. But using the principle components should (at least in theory) minimize it.
Note that because PCA assumes a relatively simple model for the data set (a Gaussian blob), it essentially ignores much of the structure of data sets in which the underlying distribution is curved in some way. So the PCA projection will not go out of its way to preserve this structure in the projection. I mentioned in the PCA post that there are advanced techniques such as something called a kernel that can be used to make PCA work better on data sets with a curved/more complicated structure. There are also projection algorithms unrelated to PCA that are specifically designed for curved data sets. However, those will have to wait for future posts.