In the last few posts, I described how to turn data sets with varying amounts of structure into vector data that could be analyzed using standard data mining/machine learning algorithms. All the types of data that I looked at were static, i.e. each data point was a single instance of information, whether it was measurements from a single iris flower, census data from a single person or the contents of a single text message. In this post and the next, I will switch gears to data that varies with time, which is called time series data. In other words, we will want to take a sequence of readings and turn them into a single data point in some (possibly high dimensional) data space.
There are countless ways to analyze time series, and my goal here is just to describe a few initial ideas that one can try. We’ll take our cues from two sources: First, one of the most fundamental time series in the human experience is sound, which is transmitted into our ears as a (very fast) sequence of pressure changes on our ear drums. Second, each text message that we looked at in Case study 3 can be thought of as a time series in which each letter in the message is the next value in the time series.
The key to translating the text messages into data vectors was that we were willing to throw away some of the information in each text message, namely the order in which the words appeared. We were left with a handful of words/tokens for each text message, which we translated into a vector with each entry a 1 or 0, indicating whether or not each token is present in the given message. (This is called a bag of words representation.) Note that we didn’t completely discard the time information: the smallest unit of time in the series is the letters, and the words/tokens are are snippets of this series of letters. So the order of the letters within each token/snippet matters, but we discard the order in which the tokens/snippets occur relative to each other.
It is often useful to do something similar for more general types of time series: given a collection of tokens that can appear in some type of time series, we may want to ask which of these tokens appear in a particular time series. For example, if we have a collection of sound recordings (which as noted above, we can think of as time series) and a recording of a bird chirp (which will be our token), we can ask which recordings include that chirp (or a similar chirp). As with the text messages, we won’t necessarily care when in the recording the chirp occurs, just whether or not it’s there. If we had a collection of short token recordings, we could record which of them occur in a given longer recording as a binary vector analogous to the bag-of-words for a piece of text.
Note that there are essentially two parts to this process: First, we need to decide what tokens we’re looking for. Second, we need to decide whether or not they appear in a given time series. In this post, I’m going to focus on the second part. In my next post, I’ll discuss the first, which is much more technical and relies on understanding the second part.
So we would like a way to decide which tokens appear in a given series, so that we can translate each time series into a vector of 1s and 0s, like we did with the text messages. (Note that these vectors will all have the same dimension, namely the number of tokens that we’re looking for, independent of the length of the original time series.) In practice, we will often end up with feature values between 0 and 1, since the tokens may be very close to something in a given time series but not match exactly. For example, if our sound recording has a chirp from a different bird in the same species as the bird from the token, their chirps will probably be a little different. So we might want to record a 92% match as a .92, etc.
Most time series come in the form of a stream of numbers. In some series the numbers come at regular intervals. For example in sound recordings, each number is essentially the pressure on the membrane of a microphone caused by a sound wave at a given time. The amount of time between measurements is determined by the recording frequency, usually measured in kilahertz (kHz). A 44.1 kHz MP3 file has 1/44,100 seconds between each time step. When recording these types of time series, it is only necessary to include the reading at each time, as long as the frequency is recorded in the metadata.
In other time series, there are irregular intervals between readings. For example, the ticker for a stock symbol on the New York Stock Exchange records the price that was paid for the stock each time someone buys the stock from someone else. There may be large time gaps between sales, followed by bursts of rapid trading. A file that records this type of time series must include a time stamp for each reading, in addition to the value at that time.
There are also time series in which more than just one number is recorded at each time. For example, there are data sets (such as this one) that record position information from the gyroscope and accelerometer in a cell phone over a period of time. At each time, a six-dimensional vector with three readings from each sensor is recorded.
So lets return to the original question: How do we decide if a given token is represented in a given time series? Note that the token will be a short time series of the same form as the time series in question. This is where our ears come in. Even among those of us who are tone deaf, ears are exceptionally good at picking out frequencies in sounds. In fact, when your ears transmit sound data to your brain, they do so not as a sequence of pressures, but as a spectrum of frequencies. (This translation into frequencies is called a Fourier transform.) The translation is done not by neurons, but by a physical process that takes place in the part of the ear called the cochlea, using a physical phenomenon called resonance.
You should have an intuitive understanding of resonance if you’ve ever pushed someone on a swing (something I’ve done my fair share of in the last few years.) The first time you ever pushed someone on a swing, you probably figured out very quickly that you need to push them while they’re moving away from you, not towards you, as in the Figure below. This is simple physics: If you push them while they’re moving towards you, you will be acting against their momentum, and thus decrease their overall kinetic energy. If you push when they’re moving away from you, you will be adding to their kinetic energy. (In fact, the best time to push is when they’re in the middle of their swing and closest to the ground, but this is very hard to do.)
So if you want to increase the amount of kinetic and potential energy of the child in the swing (in response to their cries of “higher! higher!”) you have to push in the same rhythm/pattern as the swing. But the rate at which the swing goes back and forth is completely determined by the weight of the child and the length of the swing. No matter how hard you push or what rate you push at, the number of times they go back and forth per minute will be the same.
In other words, the pattern that the swing follows is fixed. You can think of this fixed pattern as a token. If two people of equal strength were pushing two children of equal weight, the person whose pushing pattern closest matched the pattern of the swing would end up pushing their child higher. These two pushing patterns play the role of the two time series that we want to check for the token. The higher the swing goes, the better the match for the token. A match between the two patterns is called resonance and the fixed frequency of the swing is called its resonance frequency. (Note that the musical term resonance as opposed to dissonance is more general, but closely related to the way we’re using the term here.)
The cochlea in your ear does this with thousands of little hairs called stereocilia. Each hair is a different length and thus naturally vibrates at a different frequency, the same way that the swing goes back and forth at a fixed rate. When a sound wave vibrates the fluid in the cochlea, it causes these little hairs to vibrate (like pushing the swing) and the amount that each stereocilum vibrates (how high the swing goes) depends on how closely the pattern of the sound matches the natural pattern of the stereocilium. So, each hair is in charge of a single frequency and the amount that that hair is vibrating at any given time tells the brain the extent to which its assigned frequency was present in the last few microseconds of sound. (The liquid in the cochlea makes sure that the stereoclia don’t keep vibrating for long after the sound ends.)
So, that’s all well and good, but now we have to figure out how to do the same thing with a computer. As it turns out, it’s not so hard. If we’re willing to skip over some interesting (but reasonably dense) physics and calculus, we can calculate the amount of energy that each push adds to the swing, or that each moment of the sound wave adds to the stereocilium by just multiplying the force of the push by the velocity of the swing/stereocilium. There are two possible directions for the push and for the velocity, so we have to choose one direction to be positive and the other negative. If the push and the velocity are both positive or both negative then the product is positive, so you’re adding energy. (Aren’t negative numbers great!) If one is positive and the other negative, i.e. the push is in the opposite direction from the velocity, then the product is negative, so you’re reducing the energy.
For simplicity, lets say our token is a time series of length five, with regular intervals. We can think of those five numbers as a vector (a,b,c,d,e). To decide if the token appears at a certain time in a longer time series, we will take the five readings starting at that given time and form a vector (v,w,x,y,z). To have this vector “push on” the token to see how much energy it creates, we just have to multiply the corresponding elements (which will give us the added energy at each individual time) then add them together (for the total energy). In other words, the value av + bw + cx + dy + ez tells us how closely they match.
If you remember your linear algebra, you may recognize this as the dot product of the two vectors. Recall that the dot product is used to calculate the angles between the vectors and has played a role in almost all of my earlier posts (though I didn’t always mention it by name.) That’s quite convenient because it means that we can just use the token (a,b,c,d,e) and the snippet (v,w,x,y,z) like any other pair of vectors. (Perhaps this was a waste of time to put all this energy into explaining resonance when we ultimately decided to just use the time series in the most direct and simple way, but at least we now know why this simple approach is reasonable.)
There is one more small detail: Remember that when we compared the swing pushers, I insisted that the pushers must have equal strength and the swing riders must have equal weight. It’s obvious that if one of the pushers is significantly stronger than the other, they will make the child go higher, no matter how bad their timing is (though it may not be a pleasant experience for the child.) We can overcome this by (drumroll please…) normalizing the vectors (a,b,c,d,e) and (v,w,x,y,z) to have length 1, i.e. so that the dot product of each vector with itself is 1. After we do this, their dot product will be exactly equal to the cosine of the angle between them. A larger cosine (more energy added to the swing) means a smaller angle, so having the patterns match closely will correspond to the two vectors being very close in the data space. This will always give us a value between 0 and 1 that tells us how closely the token matches the snippet from the original time series. If we want to decide whether or not the token appears anywhere in the time series, we’ll need to look at a number of overlapping snippets throughout the time series, compare each one to the token, and record the maximum dot product over all the snippets.
For series with irregular time intervals, things are a little complicated, but the same basic theory applies. Since I said my goal was just to give you a few ideas about how you might start analyzing time series, I’m going to leave that case alone for now.
Instead, lets look at a specific data set with time series with regular intervals. The Robot Execution Failures Data Set on the UCI repository has time series of force and torque measurements taken from a robot right after it detected a failure. Each time series is labeled with one of four types of failures, and the goal is to predict what type it was based on the time series.
One nice thing about this data set is that each time series is the same length, namely 15 pairs of force/torque readings, 21 milliseconds apart, starting from the time of failure detection. Since these are all the same length, we don’t need to cut snippets out. Instead, we will treat each time series as a single vector of dimension 15 * 6 = 90. We’ll look at methods involving snippets and tokens in the next post. For now, I just want to demonstrate that thinking of time series as vectors in this way is a reasonable procedure.
Each time series in the data set is arranged as a block of data such as the following:
normal -1 -1 63 -3 -1 0 0 0 62 -3 -1 0 -1 -1 61 -3 0 0 -1 -1 63 -2 -1 0 -1 -1 63 -3 -1 0 -1 -1 63 -3 -1 0 -1 -1 63 -3 0 0 -1 -1 63 -3 -1 0 -1 -1 63 -3 -1 0 -1 -1 61 -3 0 0 -1 -1 61 -3 0 0 -1 -1 64 -3 -1 0 -1 -1 64 -3 -1 0 -1 -1 60 -3 0 0 -1 0 64 -2 -1 0
The first line contains the label for the time series. In this example, ‘normal’ means there wasn’t a failure. In the file ‘lp1.data’, there are three other labels: ‘collision’, ‘obstruction’ and ‘fr_collision’. Each row contains six force and torque readings from a single point in time. As described above, we’re going to just concatenate these lines into one long vector. Here’s the python code to do this. (The file has been read in to a list of strings called ‘lines’ and the number of time series has already been calculated and stored in the variable ‘rows’.)
# Create an empty data matrix and labels list data = scipy.zeros([rows, 90]) labels =  # Cycle through each block of data for i in range(rows): # Read the label from the first line of the block labels.append(lines[i*18]) for j in range(15): # Cycle through the data lines d = lines[i*18 + j + 1].split('\t') # Read in the six numbers from each time step for k in range(6): data[i,j*6+k] = float(d[k+1])
There are only 88 lines of data in the file, but as noted above, we have 90 features. Whenever there are more features than data points, any classification algorithm is virtually guaranteed to suffer from overfitting. However, we can get a rough idea of how much structure this form of feature extraction produces by doing a PCA plot. Here it is:
The ‘normal’ time series are shown in green and form a very tight grouping. The ‘obstruction’ and ‘fr_collision’ series are shown in red and yellow, and appear to be well separated from the greens, and to a lesser extent from each other. The blue points are in the ‘collision’ class and it is hard to tell whether or not they’re separated from the greens. So lets do another plot with just the greens and blues. Here it is:
This is more or less a blow-up of the green/blue blob in the picture above. Again, the green ‘normal’ points are tightly packed. But this time the blue points are widely dispersed. They don’t appear to be linearly separated from the green, because they surround the green on all sides. But, of course, because of the way projection works, there’s no way to tell for sure unless we run a linear classifier like SVM or logistic regression on it. I don’t think it’s worth doing on this data set since, as I noted above, overfitting is going to be a huge problem. But if you’d like to, you can download all the code from the Shape-of-Data github archive and try it yourself. (Note that I changed the name of the robot data file from “lp1.data” to “robotlp1.data” so that it would be easier to remember what it is.)
So this suggests that forming longer vectors from time series (or snippets of time series) in this way does a reasonable job of creating the type of structure that we’re used to. As I noted above, the real trick is going to be to use this process with snippets/tokens in order to analyze time series of different lengths. But that will have to wait for my next post.