In this post, we’ll start working on understanding how computers and their programmers actually go about analyzing large, high-dimensional data sets. I’ll start by describing three different ways that one can think about data, each of which suggests a different set of tools and possibilties.

As an example, lets consider an insurance company’s customer database. Like any database, this is a list of records, each record representing a customer. Each record has a number of entries with information about that customer. Our database will have the following entries: Customer ID, name, address, age, number of years driving, number of traffic tickets in the past five years, number of accidents in the past 5 years, and dollar amount of insurance claims paid since the account was started.

In order to get in a more general mindset, we will refer to the information in this database as a *data set*, since the analysis that we’re going to talk about doesn’t actually depend on how it’s stored.

The database description suggests a picture of the data as a stack of index cards in which we can flip through and look at each card. For example, if we want to know the address of a specific customer, we sort through until we find that card. If we want to know how many customers are between the ages of 16 and 21, we sort through the stack (or have a computer sort through), pull out all the cards on which the age is in that range, then count them. These searches can, of course, be very sophisticated but this is the basic database functionality.

The second way we can picture this data is as a collection of numbers. Lets say we form a hypothesis that the average number of tickets of drivers under 30 will be less than the average number for drivers over 30. Well, that’s easy: we sort through the cards, add up the necessary numbers, then divide. Once the calculation is done, we’ve either proved or disproved the hypothesis. Again, we could do more advanced analysis like calculate which streets have the highest accident rates, or even calculate how, on average, age correlates to accident rates. (In statistics, this last one is called *regression*.)

For this type of analysis, the computer is doing a lot of the work, but in each case, we have to present it with a hypothesis (e.g. that number of accidents should correlate to age) that we’ve come up with on our own. The goal of modern data analysis is for the computer to help us make new hypotheses, or possibly make the hypotheses itself. In order to understand how this might happen, we have to look at the data set in a third way.

If we didn’t have a computer to look for a correlation between age and number of accidents, we might have tried to do it ourselves with a pencil and some graph paper. We could draw horizontal and vertical axes (one for age, one for number of accidents in the past five years), then plot one point for each entry in the database. The horizontal position of each point will be the age entry on the card and the vertical position will be the number of accidents (or vice versa if you drew your axes differently than I did.) These points are shown on the left in the figure blow. They all lie close to a line that tells us the correlation between age and number of accidents, though in general we could try to find a more complicated curve that they follow. (I’ll write more about this in a future post.)

You probably did something along these lines in high school, or maybe a college statistics course. Even if you didn’t realize this at the time, what you were doing was actually geometry – points and lines, right? In other words, what you did was to translate the numbers in the data set into points in the plane (i.e. the piece of paper) and then summarize them with a line. The statistical algorithm called *regression*, which looks for a correlation between the two sets of number, attempts to find a similar line to the one you would draw by sight.

Our hypothesis about the correlation between age and number of accidents defined a two dimensional geometric problem, but we actually have a lot more numbers than that: age, number of years driving, number of traffic tickets in the past five years, number of accidents in the past 5 years, and amount of insurance claims makes a total of five numbers. There is other information in the database, such as the customers names and addresses, that aren’t simple numbers, but we’ll leave those for a future blog post. For now, if we want to work without any explicit hypothesis, we need to figure out what these five numbers mean geometrically.

When we had two numbers, we could just plot them along the x and y axes on a piece of paper. That is, each number becomes a coordinate – the first tells us how far to go to the right and the second tells us how far up to go. We could do something similar if we had three numbers: The first coordinate tells us how far to the right we go, the second coordinate tells us how far along the y axis on the paper, then the third coordinate tells us how far up (above the paper) to go. So, our points are no longer “in” the piece of paper – they’re now in the three-dimensional space around the paper, as on the right side of the picture.

That’s great for three types of numbers, but in our example we have five coordinates. If we lived in five dimensional space then we could easily picture these as spacial coordinates in the world that we lived in. Unfortunately, we can only picture three dimensions. (Some people claim that they can visualize more dimensions, but I don’t believe them.) That’s going to make things difficult, but if we’re willing to engage in a little suspension of disbelief, we may not be completely stuck.

Lets say that our universe was actually sitting in a three dimensional piece of paper on the desk of a supernatural creature who lives in five dimensional space. Then it would see those two extra dimensions as sticking out of this piece of paper similar to how we saw the third dimension sticking out of our two-dimensional piece of paper. So even though we can’t “see” the geometry of these points, we can believe that they represent geometric points in some hypothetical five dimensional space.

In the upcoming blog posts, we will attempt to turn what we know about points in two- and three-dimensional space into techniques that can tell us information about these higher dimensional spaces (even if we can’t see them) and thus to analyze the higher dimensional data that lives in these spaces. In some cases, that will include visualization – attempting to represent the data in lower dimensional space in a way that still captures its high dimensional structure. But more often, we will look at ways to analyze its structure while leaving the data in its higher dimensional setting.

wonderful blog – I look forward to reading the entire seriesr

Pingback: Visualization and Projection | The Shape of Data

Pingback: Nearest Neighbors Classification | The Shape of Data

Very good introduction! You managed it to give a simple introduction to the abstract topic “data”. Really appreciate this.

Pingback: The shape of data | spider's space

Pingback: Graphs and networks | The Shape of Data

Pingback: Optimization | The Shape of Data

Pingback: P-values | The Shape of Data

Good blog, interesrtingly written. Thank you for posting

Pingback: Nearest Neighbors Classification | 数据化学