In the past two posts, I’ve been looking at ways to turn “unstructured” data into vector data that can be analyzed with techniques like the ones that I’ve described elsewhere on this blog. One of the most common types of unstructured data that you’ll find on the internet is text, whether it’s news stories, tweets or spam e-mails. This can be very difficult to deal with, since the meaning behind the text generally depends heavily on things like the order that the words are in, their context and on subtle cultural subtexts. The field of natural language processing has come up with some very clever ways to deal with with a lot these issues, but that’s not what I’m going to write about today. There are many analysis problems that don’t require a complete understanding of the meaning of different pieces of text, so much as a general impression. So today, I want to look at a relatively simple way to deal with text data by generalizing the method from my last post for dealing with token-based data.
As with the last few posts, we’re going to be interested in classification problems. For example, lets say we have a collection of texts such as a number of different new stories or a number of tweets from twitter, and we want to decide which of them are about fishing. We should be able to get a fairly accurate prediction just by looking at what words appear in each text. For example if we see both the words “bait” and “fish”, there’s a good chance we’ve got a positive. There are some words where things could go either way. For example, is the word “tackle” referring to fishing equipment or to football? If we’re reading it on our own, we would be able to tell from context. But for a long enough text, even if we don’t know the context of each specific word, we should be able to make a prediction based on what other words appear. There will also be some words that make the text less likely to be about fishing. For example, the word “touchdown” would be a good negative word, to cancel out any appearance of “tackle” in the wrong context.
For a topic that you know well, you could probably pick out a number of words that will increase or decrease the chances that a text is about your topic. But our goal, of course, is to have the computer pick out these words, i.e. to build a model for deciding which texts are related to a given topic, or fit into some other type of classification.
A popular version of this problem is the detection of spam messages. So this week, we’ll look at the SMS Spam Collection data set from the UCI data archive. Each line in this data set consists of either the word spam or ham (the latter of which means a non-spam message), followed by the contents of an SMS message (i.e. a text message you would get on your phone.) The goal is to use the spam/ham label on a subset of the messages (the training set) to build a model that will correctly predict whether each of the remaining messages (the evaluation set) is spam or ham.
As I mentioned above, the most common way to do this is to use the token idea from my last post. In the census data set, there are a number of categories, each with a small number of possible values. Each data point consists of a list of tokens, one from each category. We turned this data into vectors by creating one column for each possible choice (rather than one for each category.)
When we deal with text, it initially seems like we’re in a very different situation: We don’t have any categories, and there are an essentially limitless number of possible words that could appear. But actually, things aren’t so different. For example, we could think of the text message as a single category, where instead of picking a single option/token for that category, we get to choose as many as we want. We still have a slight problem that in theory, there are an unlimited number of possible words that could appear in text messages. Luckily, there are a relatively small number of words that actually appear in the text messages in this particular data set, so we’ll just deal with those.
In other words, we can build our own list of possible tokens by recording all the different words that appear in all the text messages. Below is the python code to do this. Note that the line after the first comment strips all punctuation from the line and makes it lower case. This way any word that appears right before a comma or period, or at the beginning of a sentence doesn’t get counted as a separate. (I realize that proper punctuation is not very common in text messages, but you never know…) Here’s the code.
from collections import Counter f = open("SMSSpamCollection.data", "r") # Read each line of the data file, removing punctuation lines = [l.translate(string.maketrans("",""), string.punctuation).lower() for l in f] f.close() wordcount = Counter() # Tracks how often each word appears for l in lines: # For each text message for w in l.split()[1:]: # For each word in the text wordcount[w] += 1 # Record it in wordcount # Create a list of words that appear at least five times words = [w for w in wordcount if wordcount[w] > 5]
Two things to note: First, I changed the data file name from “SMSSpamCollection” to “SMSSpamCollection.data” so that it would have the same file extension as the other data files I’ve looked at. Second, I split the third line of code into two to make it fit in the browser window. If you copy this into a python file, you’ll have to unsplit it to make it work. You can also download the entire script from the Shape of Data github page.
After running this, I found that among the 5,574 text messages (of which 747 are spam), there are 9,663 different words that appear. Among these words, there are only 1,600 that appear more than five times. I decided to leave out these less frequent words, since we want a model that will focus on features that are likely to come up often in future text messages. For example, many of the less frequent words are URLs, which seem to appear in a lot of the spam messages and probably won’t be re-occurring.
Now that we have a list of tokens, we can translate the text messages into vectors. As with the last post, we will create a set of vectors with one dimension/feature for each possible token/word. For each text message and each word, we will set the corresponding entry in the vector to 1.0 if the word appears in the text message and to 0.0 otherwise. Another option would be to set each entry to the number of times the word appears in the text message, but we won’t explore that today. Here’s the python code:
# Create empty data and label holders data = scipy.zeros([len(lines),len(words)]) labels = scipy.zeros(len(lines)) for i in range(len(lines)): # For each text message l = lines[i].split() # Make a list of words for j in range(len(words)): # For each token if words[j] in l[1:]: # Set the entry to 1.0 if data[i,j] = 1.0 # the word appears if l == "spam": # Set the labels labels[i] = 1.0
We now have all the text message data in the same type of format that we used for both the Iris data set and the census data set, so we can analyze it the same way.
Again, note that we have thrown away a lot of information in this process. In particular, we no longer know what order the words appeared in, so we can’t reconstruct the original meaning of each text. In fact, in many cases we won’t be able to tell what part of speech a word is, since there are words like “mug” that can be used both as a noun and as a verb, depending on context. One way to deal with this issue would be to use a natural language processing algorithm like part-of-speech tagging which predicts how each word is being used and tags it as noun/verb/adjective/etc. In python, the NLTK package can do all of this relatively easily, but we won’t go into that today.
Once we have the data in vector form, we can create a PCA plot using the same code from the last post, which is shown below. You can see that there appears to be some real structure, and that the blue dots (the spam) are reasonably separated from the green dots (the ham).
To see how well separated they really are, we can run a classification algorithm. When I split the data into 80% training and 20% evaluation sets and ran logistic regression, I got an accuracy of 98.2%, which is pretty darn good.
We also have the option of scaling the the features, like we did with the census data. The resulting PCA plot is shown on the left. The combination of treating the words as tokens and scaling them based on how common they are is called tf-idf, which stands for token frequency – inverse document frequency. In other words, each feature represents how often each word appears in the given text, divided by (i.e. times the inverse of) how often the word appears in all the texts. You can do this automatically with the sci-kit learn tf-idf function, but like I’ve said before, I like to do it by hand to make sure I really know what’s going on. With logistic regression, I get an accuracy of 92.7% on the scaled data, which is still pretty good, but lower than with the unscaled features.
One interesting thing about using a linear model like logistic regression for this type of data is that the hyperplane that (roughly) separates the blue and the green points is defined by a vector, i.e. a list of numbers with one number for each word in our list. When we evaluate a new message (by taking a dot product), the model goes through this list of numbers and adds up all the numbers that correspond to words appearing in the message. If the sum is positive, it predicts that the message is spam. If the sum is negative, it predicts not spam. So the vector discovered by the logistic regression model basically carries out the scheme that I described at the beginning of this post: It assigns a weight to each word recording to what extent that word indicates that a message will be spam. Then it combines all those weights (which allows some words to cancel out others) to decide if the overall message is spam. And moreover, as we saw from the training and evaluation sets, this scheme actually works pretty well.