At this point, I think it will be useful to introduce an idea from geometry that is very helpful in pure mathematics, and that I find helpful for understanding the geometry of data sets. This idea is difference between the intrinsic structure of an object (such as a data set) and its extrinsic structure. Have you ever gone into a building, walked down a number of different halls and through different rooms, and when you finally got to where you’re going and looked out the window, you realized that you had no idea which direction you were facing, or which side of the building you were actually on? The intrinsic structure of a building has to do with how the rooms, halls and staircases connect up to each other. The extrinsic structure is how these rooms, halls and staircases sit with respect to the outside world. So, while you’re inside the building you may be very aware of the intrinsic structure, but completely lose track of the extrinsic structure.
You can see a similar distinction with subway maps, such as the famous London tube map. This map records how the different tube stops connect to each other, but it distorts how the stops sit within the city. In other words, the coordinates on the tube map do not represent the physical/GPS coordinates of the different stops. But while you’re riding a subway, the physical coordinates of the different stops are much less important than the inter-connectivity of the stations. In other words, the intrinsic structure of the subway is more important (while you’re riding it) than the extrinsic structure. On the other hand, if you were walking through a city, you would be more interested in the extrinsic structure of the city since, for example, that would tell you the distance in miles (or kilometers) between you and your destination.
Data sets also have both intrinsic and extrinsic structure, though there isn’t a sharp line between where the intrinsic structure ends and the extrinsic structure begins. These are more intuitive terms than precise definitions. In the figure below, which shows three two-dimensional data sets, the set on the left has an intrinsic structure very similar to that of the middle data set: Both have two blobs of data points connected by a narrow neck of data points. However, in the data set on the left the narrow neck forms a roughly straight line. In the center, the tube curves around, so that the entire set roughly follows a circle.
If you were somehow shrunk down so that you could walk around “inside” the data set and could only see nearby data points, you might think of the two blobs in each data set as rooms, and the narrow neck as a hallway. If you walked from one room to the other, you might not notice whether or not it was curving. So as in the building example, you would have a hard time telling the difference between the two sets from “inside” of them. Thus the difference between the two data sets is mostly a matter of intrinsic structure, rather than extrinsic structure.
On the other hand, the set on the right has a very similar extrinsic structure to the data set in the middle: Both sets roughly follow a circle, so from far away, we might not notice the difference. However the data set on the right consists of a single circular neck/hallway, without any blobs. Thus the intrinsic structures of the center data set and the right data set are very difference.
As we noted above, in real life, we will sometimes be more interested in intrinsic structure and sometimes we’ll be more interested in extrinsic structure. Similarly, we will sometimes be more interested in the intrinsic structure of data set and sometime in its extrinsic structure. More precisely, we will be interested in both, but will focus more on one or the other at different stages in the analysis. For example, a simple model fitting algorithm, such as regression, logistic regression or SVM, gives us a description of the extrinsic structure of the data set by giving us a best fit hyperplane or a decision boundary. However, these algorithms don’t tell us anything about the intrinsic structure. In some sense, these algorithms assume that the intrinsic structure is relatively simple and can lead to very inaccurate models if it isn’t. So ideally, we would want to understand the intrinsic structure of the data set before we apply such an algorithm.
More flexible algorithms, such as KNN, neural networks and decision trees, are able to adapt to the intrinsic structure of a data set, and thus produce a more accurate model. (Though each algorithm has parameters that affect how flexible it can be in adapting to different intrinsic structures.) Of course, they encode the intrinsic and extrinsic structure in a way that is essentially unreadable by humans. Somewhere in the middle is the mixture model algorithm: The output of this algorithm reflects the extrinsic structure of the data set, but in order to use a mixture model, you first need to choose the types of simple models that will make up the mixture. This boils down to deciding what intrinsic structure we want the final model/distribution to have. If that intrinsic structure matches the intrinsic structure of the data set then the final model will be reasonably accurate. So again, an approach like this will work better if you first spend some time trying to understand the intrinsic structure of the data.
The distinction between intrinsic and extrinsic structure is closely related to the difference between local and global structure: The local structure of something is the structure that you see when you zoom in very close, while the global structure is what you see when you look at it from far away. For example, the local structure of your shirt is a tangle of crisscrossing threads. Its global structure is a two-dimensional shape with four holes (one for your head, two for your arms and a big hole at the bottom for your waist.) The extrinsic structure of an object is essentially the same as its global structure. However, an object’s intrinsic structure isn’t the same as its local structure. Instead, it’s more or less what you get by combining all the local structures and fitting them together. So the intrinsic structure of your shirt is also two-dimensional with four holes, but the intrinsic structure doesn’t know whether the shirt is folded up, hanging, or crumpled in a pile. The intrinsic/extrinsic dichotomy is also closely related to the difference between geometry and topology, but that’ll have to wait for a later post.
Note that some data sets don’t have a natural extrinsic structure, or at least not one that makes sense. For example, consider how we would analyze a data set of books based on their subjects. We can all agree that two books about physics will be very similar, and a book about chemistry will be slightly less similar to either of the physics books. However, it’s not obvious whether a book on poetry would be closer to a physics book than a book on management would be. In other words, it is clear what the distances between nearby books should be (the local/intrinsic structure), but not what the distances between far away books should be (the global/extrinsic structure).
We could define an extrinsic structure on books, for example by counting how many times each word appears in each book, and defining vectors based on this word count. However, it is unlikely that the resulting vectors would accurately reflect the structure that we want. There are various ways to modify this structure, such as by combining synonyms, disambiguating different uses of the same word, but there is no canonical approach. In fact, when data analysts choose these more complicated structures, the goal is essentially to find an extrinsic structure that best reflects the (relatively) well defined intrinsic structure.
Another example of a data set with no extrinsic structure is a social network: The connections in the network (friends on Facebook, followers on twitter, etc.) define the nearby points, similar to a subway map. However, there generally won’t be a natural extrinsic structure. (I’ll go into more detail about graphs and networks in next week’s post.) As with the books example, there are different ways to impose an extrinsic structure on a data set of this form, but there isn’t a single best extrinsic structure.
The main way in which the intrinsic structure of a data set is usually explored in practice is by searching for clusters, or clumps of data points. There are a number of different ways in which clusters are defined, and a corresponding number of algorithms for finding them, some of which I’ll cover in upcoming posts. Clusters can be densely packed sets of data points surrounded by less dense parts of the data set. Or, they can be blobs of data points that are either separated from the rest of the data set or connected by relatively thin bottlenecks, as in the Figure above. As you can see, both of these descriptions are in terms of intrinsic properties of the data set.
Up until now, the posts on this blog have focused primarily on extrinsic structure, with the goal of building models/distributions. These algorithms mostly fall into the category of supervised learning or predictive analytics: Building models that answer specific questions about data, by using a training data set for which the answer is already known. The next few posts will focus on unsupervised learning or descriptive analytics – methods whose goal is not to build a model or answer a specific question, but rather to better understand the structure as a preliminary step before building a model. Finding clusters is one of the main problems in this area and will be the focus of the upcoming posts. Throughout these posts on unsupervised learning, the intrinsic structure will be an underlying theme.