In the last post, I explained how to measure how similar two time series of the same length are using the idea of resonance from physics. The answer boiled down to calculating the dot product of the two time series, a measure that is commonly used in linear algebra for calculating the angle between two vectors. So this fits nicely into the general theme of this blog – thinking of data as geometric information (in this case vectors) and translating the analytical tools into geometric properties. In this post, I want to focus on the case when we have time series of possibly different lengths. In this case, we can’t take the dot product of two time series because there isn’t a one-to-one way to match up the times from the first series with the times from the second. Instead, we’ll use the idea of resonance to compare shorter snippets of the two time series using the idea of tokens from the text analysis case study.

As I described in the last post, if we have a short time series, we can compare it to a longer time series by cutting a number of (overlapping) snippets out of the longer time series and comparing the original short time series to each snippet. A large dot product between the short series and the snippet means that the two are a close match. So if we compare the short time series to a large number of snippets from the longer time series and one or more has a large dot product, then we know that something very close to the short time series appears in the long one.

Recall that in the text mining case study, we analyzed text messages by keeping track of which words appeared in them. By using the dot product as above with a number of different short time series, we can similarly determine which of the short time series “appear in” the longer one. To simplify things, I’m going to I’m going to restrict our attention to time series with a fixed sample rate. The techniques can be modified to work for variable sample rates, but it gets a bit trickier.

As I described last time, a time series of a fixed length will be defined by a vector of some dimension determined by the sample rate of the time series (i.e. how much time there is between each step in the time series) and the dimension of the reading at each time step. Here’s a snippet of python code that defines the *findToken* function that takes two one-dimensional numpy arrays (defined by the short time series and the long time series) and returns this value.

def findToken(token, series): dot_prod = [] for i in range(series.shape[0] - token.shape[0]): dot_prod.append(sum([series[i+j] * token[j] for j in range(token.shape[0])])) return max(dot_prod)

Note that this code looks at every possible snippet of the desired length. If the different time series that you’re looking at are very long, it may be necessary to space the snippets out farther to reduce computational time. Also note that we could speed up this code by replacing the *for* loop with python list notation, but this would make the code harder for you to read.

If we have a collection of short time series, we can think of it as a sort of vocabulary analogous to the words that may or may not appear in a text message. In the language of the earlier case studies, the short time series are *tokens* and we can use these tokens to translate each time series into a vector of fixed dimension. For example, if we have four short token time series, each longer time series gets translated to a vector *(a,b,c,d)* where the number *a* records the value returned by *find_token* of the first token time series and the longer time series. The number *b* records *find_token* of the second token time series and so on.

When we tokenize a time series using the dot product like this, the short/token time series are often called *wavelets*. This term comes from the idea that the time series are waves (such as sound waves) so short waves should be called wavelets. As I discussed in the last post, this is essentially how your ears work, using physical resonance to cause a small hair in the cochlea to vibrate when you hear a certain frequency (namely its resonance frequency). The token time series corresponding to the hair is a sine wave with this frequency, except that the liquid in the cochlea causes the hair to vibrate only when the corresponding sound is present. Thus the wavelet defined by the hair is something more like a sine wave times a Gaussian curve, as in the Figure to the right.

So, if we have a collection of wavelets (a vocabulary), we can use the *find_token* function to translate a time series into a collection of vectors. But that still leaves us with a big problem: If we’re only given a collection of time series, how do we pick out the vocabulary?

There is no single best answer to this question. In some cases, domain-specific knowledge can be used to determine a good vocabulary of wavelets. For example, if your time series are sound recordings, then there is a good deal of mathematical theory that tells you to use dampened since waves, like the one in the Figure above, of different frequencies. This is the basis of the Fourier transform. These wavelets have a number of nice properties that imply that, roughly speaking, you won’t lose very much information by using them.

But in cases where the type of time series is not as well understood from a theoretical perspective (which is probably the more common case), we need a way to pick the vocabulary directly out of the data. Note that this is what we did with the text message case study: When we decided which words would correspond to tokens, we didn’t load a dictionary and use all possible words. We made a list of the words that actually appeared in the data set of text messages. For time series, we’ll want to build our wavelets in a similar way.

The most obvious way to do this is to just look at all the snippets that appear in our different time series and use them as the vocabulary of wavelets. Here’s the code to do this, where I’m assuming *data* is a list of one-dimensional numpy arrays and *length* is the length that we want the snippets to be. As before, we may want to space out the snippets that we take from each time series, rather than take every possible snippet.

def makeVocab(data, length): vocab = [] for d in data: for i in range(d.shape[0] - length + 1): vocab.append(d[i:i + length]) return scipy.array(vocab)

Note that all the wavelets returned by *makeVocab* are the same length. If we want to allow wavelets of different lengths, we would have to choose a small number of possible lengths, and run *makeVocab* for each length. Below, we’re going to assume that all the vocabulary time series have the same length, so we would have to apply the procedures separately to each of the resulting sets.

The *makeVocab* function is essentially a direct translation of what we did for the text messages, but there’s a problem: If we have two snippets that are almost identical, but just off by a little, they will appear as two different tokens/wavelets in the resulting vocabulary. When we compare these two wavelets to a longer time series (via *findToken*) we’ll get essentially the same value from both wavelets. In other words, the information we get from these two wavelets is redundant. This is part of a bigger problem that the vocabulary list returned by *getVocab* will generally be a huge number of wavelets and will probably be computationally unfeasible to use as features.

So we need to take our initial vocabulary of all the wavelets that occur and reduce it to a smaller set of wavelets that have two properties: 1) They must capture the commonly occurring behavior of the different time series in the data set so that we don’t lose too much information when we translate the time series into vectors. 2) They must be relatively distinct from each other so as to minimize redundancy. Below, I’m going to suggest one possible way to do this. This is by no means the only way, though I think the general framework of the discussion will be useful for understanding other methods of picking out a smaller vocabulary of wavelets.

As always on this blog, I’m going to translate things into a geometric problem. First, lets consider the “space” of all possible time series of a fixed length. The set of all *theoretically* possible time series of that length is a vector space whose dimension is this fixed length. However, in practice the set of time series snippets that can *actually* occur will make up a relatively small portion of this vector space. In fact, we can think of the set of snippets that can actually occur as a probability distribution on the vector space of all snippets that could theoretically occur. The value of this distribution on a given vector will be (roughly speaking) the probability that the time series snippet represented by the vector will occur in any given time series of the sort we’re looking at.

From this perspective, the problem finding a vocabulary of wavelets translates to finding a collection of vectors in the space of all theoretically possible snippets that are nicely distributed throughout the probability distribution representing all the actually occurring snippets. Of the two conditions I defined above, condition (1) means that any region of space where the probability is relatively high should have a nearby snippet in the vocabulary. Condition (2) means that no two snippets in the vocabulary should be too close to each other as vectors in this vector space.

But recall that we don’t know precisely what the probability distribution of actually occurring snippets is. Instead, what we have a is finite collection of actually occurring snippets from our data set of time series. We can think of these as a large set of points “sampled” from the space of snippets. Before we can find our smaller vocabulary of snippets, we need to translate these actually occurring snippets into a distribution (Does this sound familiar?) and then use the distribution to find our smaller vocabulary. But rather than trying to find an explicit representation for the distribution (as an equation or something like that), it turns out we can go directly from the large set of snippets to a smaller number of representatives that are relatively evenly distributed throughout the data set. The translation from large vocabulary to distribution to small vocabulary will be implicit in the one-step translation.

Now, if you’ve been following this blog closely, you may remember seeing a very similar problem a few months ago. I’ll give you a minute to think about it…

Did you get it? This is the problem that I suggested you could solve using the K-means algorithm. But there’s something funny going on here: I suggested using K-means to find a small number of representatives for a data set. But the snippets that we’re trying to find here aren’t representatives of the data set. Instead, they’re tokens that will define the dimensions/features of the data space that we will translate the different time series into. In other words, we’re using K-means for feature selection! Here’s the code to do this with Sci-kit learn’s K-means function.

from sklearn.cluster import KMeans def shortenVocab(long_vocab, n_tokens): KM = KMeans(init='random', n_clusters=n_tokens) KM.fit(long_vocab) return KM.cluster_centers_

This is an example of the strange duality between data points and features that often comes up in data analysis. In fact, this is closely related to the problem that came up with Gaussian kernels: The space of all possible snippets can be thought of as a space of functions (i.e. a function whose domain is time), analogous to the space of all probability distribution functions that we used to define a Gaussian kernel. To define a finite dimensional Gaussian kernel, we needed to find a small number of points in the data space that were nicely distributed throughout the distribution defining the data set. By taking these data points as the centers of Gaussian functions, we effectively translated them into the features/dimensions of a Gaussian kernel.

To tie this all together, here’s the code to translate a collection of time series data into vectors using the functions I defined above. The parameters that the function transformToVectors takes are a list *data* of scipy arrays of arbitrary length, a number *length* recording how long the tokens shoud be and a number *n_tokens *recording the number of tokens/wavelets that we want in the final short vocabulary.

def transformToVectors(data, length, n_tokens): long_vocab = makeVocab(data, length) short_vocab = shortenVocab(long_vocab, n_tokens) vector_data = scipy.zeros([len(data), n_tokens]) for i in range(len(data)): for j in range(n_tokens): vector_data[i,j] = findToken(short_vocab[j], data[i]) return vector_data

Unlike the last few posts, I haven’t suggested a particular data set to try this on. Mostly, this is because I haven’t found a great data set to use as an example. But it’s also because this post is probably long enough as is. If you know of a data set that you think would be good to try this on, please let me know in the comments. Or, if you want to try it yourself, you can download this code from the Shape of Data github page. The script on github runs the code on the robot arm data from the last post. If you find another data set that works well, you should also let me know in the comments!

Nice, I really liked this one. I don’t really know anything about signal processing, so I feel like I learned something from this. Thanks :).

In Python, you can continue lines as long as you are inside a parenthesis, so your code would work without unbreaking line 4 in your snippet.

Thanks. I always forget the details of the white space policies with python. I removed the instructions to combine the lines.

Let me see if I understand: you’re taking a long sequence and projecting its windows onto some set of base sequences to get a good approximation of the long sequence. And you can then compare two sequences of different lengths by projecting them both onto the wavelet basis and comparing with the usual dot product similarity.

Is this a direct translation of the more general (continuous) wavelet transforms? I’m not even sure what wavelet transforms do, but I imagine it’s this: take a small (simple) family of orthogonal continuous functions which might not necessarily form a basis, and project your function onto these pieces to get an approximation of the original signal which has some properties that mysteriously arise from the choice of wavelets.

On another note: have you heard of dynamic time warping for comparing sequences of different lengths? (http://jeremykun.com/2012/07/25/dynamic-time-warping/) It’s only messily geometric (it’s some kind of bizarre nonlinear transformation of the data), but it would be interesting to see how the two methods compare.

The method I described differs from the usual continuous wavelet transformation in two important ways:

1) The wavelets that come from K-means will almost never be an orthogonal basis. In fact, there’s no reason to expect them to be either orthogonal or a basis. (If you do want an orthogonal basis, you could produce one using a different algorithm, such as PCA on the windows/snippets.) For wavelet transformations, where the goal is to encode and decode the time series, orthogonality is important so that you can reconstruct the original time series from just the wavelet data. But here, we aren’t trying to do that.

2) The wavelet transform turns a given time series into a new time series in which the value at each time is the dot product of the corresponding window with the wavelet. Since I want to get a fixed length vector (rather than a new time series), I suggested just keeping track of the maximum of this function (though you could also keep track of its integral, or some other measure.)

I have read about dynamic time warping and I’m hoping to discuss it in a future post (though I’m not sure exactly when.) DTW is, in fact, very geometric, or at least topological. Compared to wavelets, it’s a better measure of the large scale, overall behavior of a time series. Whereas, wavelets try to encode the different short-term behaviors of a time series.

Excellent, thanks! I think I’d enjoy reading your blog more with a little bit more rigor. For me, the hardest part of this stuff has always been getting from a fuzzy understanding to a mildly mature one. Really absorbing definitions to understand all the work that goes into crafting them and such. But I really appreciate the code snippets. Great work!

Pingback: Case Study 6: Digital images | The Shape of Data