Since the start of this blog, we’ve covered a lot of different algorithms that attempt to discover and summarize the geometric structure in a given data set. But as it turns out, this part of the data analysis is the (relatively) easy part. In the real world, most data starts out as what’s often referred to as “unstructured data”: Rather than being nicely arranged as rows and columns of numbers that can be interpreted as vectors in a high dimensional space, it comes in some other form. There will often be patterns in these other forms of data, but not the types of patterns to which we can immediately apply the sort of analysis that I’ve been describing for the past few months. So the first job of the data analyst (or the data scientist if you prefer) is generally to extract the structure from a set of “unstructured” data, i.e. to transform the patterns into a type of structure that we’re used to. This process is often called *feature extraction* or *feature engineering*. In the next few posts, I’m going to look at some specific data sets, with different initial levels of structure, and consider different possible ways to extract the kind of structure that we want from them.

This week, we’ll warm up with a data set that is already in vector form, in order to see what kind of structure we’ll be aiming for later on. The iris data set was compiled in 1936 by Ronald Fisher and has become a classic example in data mining/machine learning. It consists of measurements taken from 150 iris plants, with 50 plants from each of three species. For each plant, Fisher measured the sepal length, sepal width, petal length and petal width. You can see the data set on the wikipedia page, or download it from the Iris page on the UCI Machine Learning Repository (which has lots of other interesting data sets). The file iris.data is a text file with all the values. The first five lines from this file look like this:

5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa

In each row, the first four entries are the four measurements described above and the fifth is the name of the species. (Farther down the file, there are 45 more lines of Iris-setosa, then fifty lines each of of Iris-versicolor, then Iris-virginica.) So, we want to think of each row as a vector followed by a label. If you have python installed on your computer, you can load the data with the following script (which assumes the file iris.data is in the same folder as the script):

# labelnames translates the species names into integers labelnames = {"Iris-setosa\n":0, "Iris-versicolor\n":1, "Iris-virginica\n":2} data, labels = [], [] # Empty lists for the vectors, labels. f = open("iris.data", "r") # Open the data file. for line in f: # Load in each line from the file. vals = line.split(',') # Split the line into individual values if len(vals) == 5: # Check that there are five columns data.append(vals[:-1]) # Add data vector labels.append(labelnames[vals[4]]) # Add numerical label f.close() # Close the file

One thing to note about the script is that python inputs each line with a carriage return (‘\n’) at the end, which gets appended to the last part of each line, in this case the label. So we have to either remove the final character, or include it in the dictionary entries. (I’ve done the latter above.) There’s also a line that checks that the data has been split into five values. That’s because the last line of the file is blank, and would cause an error otherwise. There are a number of python packages with built in methods for loading data (Pandas seems to be a popular one.) but I like to do it by hand whenever possible, to make sure I know exactly what’s going on.

Looking at the values in each vector doesn’t give us a lot of insight into the structure. We would be much better off if we could see them plotted in space. That’s going to be difficult since it’s four-dimensional data, but as we saw in the post on PCA, we can get a rough sense of the structure by projecting the data into two-dimensional space.

Below is the code that I used. To run this, you need to have the numpy and matplotlib packages installed. Matplotlib is tricky to install with the standard python installation, so I recommend using the Anaconda python bundle, which comes with all of the main data analysis packages installed. (I also recommend using the Spyder IDE that comes with Anaconda.)

import numpy, matplotlib.pyplot colors = ['red','green','blue'] # Dictionary classes -> colors data = numpy.array(data, dtype="float") # Convert data to an array M = (data-numpy.mean(data.T, axis=1)).T # Mean-center the data # Find the matrix of principal components [latent, coeff] = numpy.linalg.eig(numpy.cov(M)) coords = numpy.dot(coeff.T, M) # Convert data to PC coordinates fig, ax = matplotlib.pyplot.subplots() # Create matplotlib figure for i in range(data.shape[0]): # Plot all the points ax.scatter(coords[0,i], coords[1,i], color=colors[labels[i]])

And here’s the picture that that we get, with the three classes 0, 1 and 2 drawn in red, green and blue, respectively.

The first thing we immediately see is that the first class (the red points) is very far away from the other two. The blue and the green appear mostly separated, though there appears to be some overlap. Of course, because of the projection it’s possible that they’re actually more separated than they appear: In three dimensions, one of the colors might be a lot closer to the camera than the other. In four dimensions there’s even more room for separation. It’s also hard to tell if the apparent separation is just a coincidence caused by the small number of points.

The next thing you may have noticed is that each of the colors looks vaguely like an elongated blob, each following a different line. These don’t look exactly like the Gaussian blobs that I’ve drawn in past posts, but this is as close as we can expect to get with natural data. The elongated Gaussian blobs make sense for physical measurements of flowers, which will be different sizes depending on the growing conditions, but the widths and lengths should be roughly proportional.

So lets try to determine if the green and blue sets are really separated from each other. In the PCA plot, there does appear to be a curve that separates green from blue. But given the small number of data points, we’ll have a pretty high risk of overfitting if we use a non-linear kernel. So we’ll stick to linear decision boundaries, i.e. trying to find a hyperplane in the four-dimensional data space between green and blue. Here’s the code to run the Sci-kit learn implementation of logistic regression on the green and blue data points:

import sklearn.linear_model # Create the Logistic Regression object LR = sklearn.linear_model.LogisticRegression() LR.fit(data[50:], labels[50:]) # Find the hyperplane pred = LR.predict(data[50:]) # Predict the classes # Count the number of correct predictions correct = len([i for i in range(100) if pred[i] == labels[i+50]]) print "Logistic regression correctly precicts", correct, print "out of 100."

The third line (not counting the comment) finds a logistic distribution that approximates the data, and the following line calculates the predicted classes of the blue and green points based on the distribution. Recall that in a logistic distribution, the decision plane is made up of the points where the distribution takes on the value 1/2. So the predicted class of each point tells us whether or not it is on the correct side of the decision hyperplane. So the remainder of the script counts and reports how many of the green and blue data points have their color predicted correctly, indicating how accurately the hyperplane separates the data.

We can do the same with a support vector machine by replacing sklearn.linear_model.LogisticRegression() with sklearn.svm.SVC(). When I run this, I get an accuracy of 97/100 for logistic regression and 98/100 for the support vector machine. You can download the full script (with both logistic regression and SVM) from this link. So this tells us that even in four dimensions, the two classes overlap slightly and the PCA projection gives a reasonable impression of how far apart they are.

Note that we haven’t split the data into training and evaluation sets. That’s because we just wanted to see how close the green and blue data points are to being linearly separated. I’ll leave it as an exercise for the reader to add the python code to split the data, train the linear models on the training set, then evaluate it on the remaining data points.

You should now have a rough idea of the type of structure that we will be looking for in other data sets. In upcoming posts, we will look at ways of turning other types of data sets into a list of vectors that we can visualize or that we can analyze using the algorithms like logistic regression and SVM in order to find their hidden structure.

Pingback: Cast study 2: Tokens in census data | The Shape of Data

Hi

I have a data set similar to data iris but it has a different format. Could you indicate how to set a data iris format?