Case study 2: Tokens in census data

In last week’s post, I presented a brief tutorial on loading and analyzing the classic IRIS data set, which consists of four length measurements from each of 150 iris flowers. That data set was relatively easy to deal with because each set of four measurements can be thought of as a vector in the four-dimensional data space. In this and the next few posts, I will look at data sets with less obvious structure, and consider different ways of translating the data into vectors that can be analyzed with geometric methods. This week I want to look at a collection of census data from the UCI Machine Learning Repository. This data set consists of (anonymized) demographic data reported by 48,842 Americans on the 1990 US census. Each line consists of a list of numbers indicating things like age and tokens representing things like employment type. The tokens in each column are taken from a relatively short list. For example, under employment type, the options include private, self-employed-inc, Federal-gov and a few others. So in this post, I want to consider the question: How can we translate the token data into vectors?

The simplest thing to do would be to just assign a number to each token and then translate them directly into vectors that way: In the employment column, for example, we mught set private = 1, self-employed-inc = 2, Federal-gov = 3, etc. The problem is that this forces us to make some of the options closer to each other than others. In this example, Federal-gov would be 2 away from private, but only 1 away from self-employed-inc. So in particular, lets say we have three people, Andy, Bob and Carl, who have all the same demographic data except that Andy works for a private company, Bob is self employed and Carl works for the federal government. Then any model we build with this scheme will see the difference between Andy and Carl as being twice as big as the difference between Andy and Bob or Bob and Carl.

While there may be some situations where something like this would be appropriate, we would be much better off developing a structure where the model can choose the relations between the three categories. In other words, by translating the tokens directly into numbers, we are essentially placing them along a one-dimensional line, as on the left in the Figure below, which forces us to put them in some order. It would be better if we could arrange them in a more even handed way, such as the triangle in the middle of the Figure. If we could do this, then the vectors corresponding to Andy, Bob and Carl would all be equally far apart in the data space.


In order to figure out how to do this, lets pretend for a second that these three – private, self-emp-inc and Federal-gov – are the only three options for the employment category. If we want to place them in a triangle, we could replace the original one-dimensional line for the employment category with a plane, and figure out the coordinates of the corners of an equilateral triangle. But it turns out there’s a better option: Instead of a two-dimensional plane, we will replace the original employment line with three dimensions, one for each of the possible tokens. In these three dimensions, the token private will correspond to the vector (1,0,0), the token self-emp-inc will correspond to the vector (0,1,0) and the final token Federal-gov will correspond to (0,0,1). These vectors are shown on the right in the Figure above.

So, lets pretend that the only two pieces of data for each person in the census data set was their age and their employment type (and these are the only three possible employment types). Then each person in the data set would define a four-dimensional vector. Andy, who is 22 years old and works for a private company gives us the vector (22,1,0,0). If Bob and Carl are also 22, they define the vectors (22,0,1,0) and (22,0,0,1). It may initially look like we have still made Andy and Bob closer than Andy and Carl, since the the 1s are closer, but when we look at these as points in space, the order of the features/dimensions won’t matter. The distances between the three points will all be the same.

The nice thing about this scheme is that it is very easy to adapt it to larger numbers of options for each category. In the actual census data set, there are 8 options for employment, so the employment category will correspond to eight different dimensions/features in our final vectors. Overall, in the census data set, there are 14 categories, but after we do the conversion, we get a data set with 105 vectors. (Fun fact: If we do this for a category with four tokens, the resulting vectors will define a tetrahedron in four-dimensional space. For larger numbers of tokens, the shape defined by the vectors is called a simplex.)

Before we look at the python code to do this, I want to point out one potential problem: Most of the numerical features in the data set are on a very large scale (For example, age can range into the low 100s) but the category data is all either 0 or 1. So if we leave the numerical values as they are, the distances between the data points will be almost entirely determined by the numerical values rather than the tokens. This may not make a huge difference for linear classifiers like logistic regression and SVM (though it will make some difference) but it can be a huge problem for an algorithm like PCA or KNN that relies heavily on distances.

One way to try and fix this is to multiply all the values in each of the numerical dimensions by a small constant so that they have a smaller range. This is called scaling or normalizing the features. We might scale things so that each numerical feature ranges from 0 to 1 as well, or so they they all have variance or standard deviation one. The method that works best will depend on the data set and what methods are being used for the final analysis. To find the best approach you need a combination of experience/domain knowledge and trial and error. Since I’m more interested in getting a general feel for the data than finding the perfect model, I’m just going to ignore the numerical features for the rest of this post, which leaves us with 99 categorical features coming from the tokens.

For the python code, we’ll start with a list of tokens for the algorithm to look for. Below is a snippet from it, and you can see the whole list by downloading the complete script from my recently-created github repository. Notice that the lists for the numerical features are empty; this is how the script decides which features to treat as numerical (and in this case, to ignore.)

tokens = [[],
          ['Private', 'Self-emp-not-inc', 'Self-emp-inc', ...]
          ['Bachelors', 'Some-college', '11th', ...],

I produced this list by copying and pasting the list of tokens from the data set description on the UCI Repository. You could also write a few lines of code to automatically generate this list, but I thought I would try to stay consistent with the official list. I did, however, write a few lines of code to count how many tokens are in each category and create a list fcount, which stores the starting feature for each category. To see those lines, you can look at the full script. The interesting part is the code that converts the lists of tokens into features.

# Load the data as strings
print "Loading data..."
f = open("")
lines = [l for l in f]  # REad in all the lines
lines = lines[:-1]      # Remove the final empty line

# Create the empty data matrix
data = scipy.zeros([len(lines), count])

# Populate data matrix using the tokens
for i in range(len(lines)):     # For each line of data
    l = lines[i].split(',')     # Split the string into tokens
    for j in range(len(l)-1):   # For each category
        if l[j][1:] in tokens[j]:   # If the token is valid
            # Calculate what feature it defines
            k = fstart[j] + tokens[j].index(l[j][1:])
            data[i,k] = 1       # Set the feature to 1

Note that I left out the last column from each line of data. The last column is the labels, which in this case indicate whether or not each person has an annual income greater than $50,000. The problem associated with this data set is to predict this based on the demographic data, so lets see what we can do. First we’ll convert the original labels (<=50K and >50K) to 0 and 1, respectively.

# Create the empty labels vector
labels = scipy.zeros(len(lines))

# Convert the labels to numbers
for i in range(len(lines)):     # For each line of data
    l = lines[i].split(',')     # Split into tokens
    # Read the first character of the label
    if l[-1][1] == '>':
        labels[i] = 1.0
        labels[i] = 0.0

To get an idea of whether or not this method gives us a reasonable geometric structure, lets look at the PCA plot, like we did with the Iris data set last week. We can use more or less the same python code from last time, so I won’t display it again (though it is in the full script.)  Note that the data set contains 32,561 points of data, so it’s probably not a good idea to plot all of them – even if it doesn’t crash your computer, they will all be drawn on top of each other. So I decided to just plot the first 500 points, and got the following picture:

CensusPCA1The green points are the ones with income below $50K and the blue points are for income above $50K. You can see that there is, indeed, geometric structure to the data set. In fact, it looks like we have somewhere between two and four clusters. Moreover, the blue points are mostly in the two blobs/clusters on the right. However, it also kind of looks like we have two copies of roughly the same structure – an upside-down ‘L’ in the top left, and another one just below it. Since PCA looks for the directions of highest variance/standard deviation, this could be the result of one or two of the tokens where the number of data points that have the token and the number of data points that don’t are pretty evenly matched. (The more evenly matched they are, the higher the variance/standard deviation will be.)

We can try to fix this by normalizing the categorical features, i.e. by dividing each 1 in the data vectors by the standard deviation of the corresponding column. Here’s the code to do that. (And note that since the numbers are all 0s and 1s, we don’t need to square them.)

s = [1.0/math.pow(sum(data[:,i]),.5) \
            for i in range(data.shape[1])]
data =

The first line (which is split to make it appear correctly in the web page) creates a list of 1 divided by each of the standard deviations. The second line scales the data matrix by multiplying it by a matrix whose diagonals is this list of scale factors.


Here’s what the PCA plot looks like after we normalize the data in this way. The geometry in this set appears much less interesting than the original, and this is a good example of how big an impact scaling can have on a data set. But the real question is, whether the algorithms that we’re interested in will work better with the scaled or the unscaled categorical data. Notice that in the plots of both the unnormalized and the normalized vectors, the green and blue points appear to be concentrated in different places. More precisely, the green spots seem to be everywhere, but there are some areas where there are very few blues. As always, we can’t tell from the PCA plot whether this is actually the case, or whether it’s just an illusion caused by the projection.

The get a more objective idea of how separated the two classes are, we should run a classification algorithm on the data set. In the complete script, I randomly split the data into a training set with 80% of the data points and a test set with the remaining 20%. The script trains a logistic regression model on the training set, then evaluates it on the test set. When I ran it, I got an accuracy around 83% for the unnormalized vectors and 75% for the normalized vectors. (The exact numbers vary between runs because the 80% is randomly generated, but the the accuracy numbers are pretty consistent.) Note that normalizing the vectors the way we did is a linear transformation, so it doesn’t change the extent to which the two classes are linearly separated. Instead, since there is no hyperplane that completely separates the classes, normalization changes which of the many imperfect hyperplanes logisitc regression chooses. (And in this case, normalization makes it choose a worse one!)

An accuracy of 83% is not bad, especially for a first pass. With some more tweaking, it might be possible to get the score higher, but this certainly validates this approach to dealing with token data. In the next few weeks, we’ll look at more ways to translate data with even less apparent structure into data with a geometric structure.

This entry was posted in Feature extraction. Bookmark the permalink.

9 Responses to Case study 2: Tokens in census data

  1. Ryan Budney says:

    Beyond scaling there are all kinds of transformations people typically do to data. Take an image of a circle in the plane, considered as a plot of points. That has persistent homology. But if you take the Fourier transform of that image and consider the scatter-plot of Fourier coefficients, you do not retain the persistent homology. One needs some kind of “strongly felt” initial context for the data in which to make sense of persistent homology.

    • One of the funny things that makes data topology very different from topology is that there is no canonical form for the objects of study. Instead of thinking about it as studying the structure of data, it seems like it’s more accurate to think about creating structure from data. There are a lot of possible structures, and the “best” structure depends on the final goal. So you kind of need a functional definition of the structure, rather than an initial context.

  2. Pingback: Case study 3: Free form text | The Shape of Data

  3. Pingback: Case Study 6: Digital images | The Shape of Data

  4. Pingback: K-modes | The Shape of Data

  5. Walter says:

    When you say “We can try to fix this by normalizing the categorical features, i.e. by dividing each 1 in the data vectors by the variance of the corresponding column.” Here “column” refers to the feature column, e.g., private, self, and fed, right? Conceptually, I’m trying to wrap my head around the idea of normalizing (or having to normalize) categorical features. After deciding that normalizing is a good way to go, why divide by the column variance and not the standard deviation?

    Thanks, I’ve really enjoyed digesting your posts over the last few years!

  6. vinay says:

    “One way to try and fix this is to multiply all the values in each of the numerical dimensions by a small constant so that they have a smaller range. This is called scaling or normalizing the features.” In the above case even though we scale it down but relative distance ratio still remains right…how can this fix the issue?

    • We would scale each dimension by a different constant, so that each dimension ends up with the same range or the same standard deviation, or whichever measure you want to use. I see now that’s unclear from the wording. Is that what you were asking about?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s