In mathematics, the term “duality” usually refers to a relationship between two object, or types of objects, that is somehow symmetric. (Note that this is different from some other uses of the term outside of mathematics.) For example, if we start with a cube and draw a dot in the center of each of its eight faces, these dots will form the corners of an octahedron (a three-dimensional shape with eight faces and six corners) inside the cube, as on the left in the Figure below. But if we do the same thing with an octahedron, drawing a dot in the center of each of its faces, then forming a shape from them, we will get back a cube, as on the right in the Figure. So, the relationship that gets us from the cube to the octahedron is the same as the relationship that gets us from the octahedron to the cube.
Well, it turns out there is a form of duality that is useful for thinking about data sets. For example, consider market basket data, that an online retailer might use to keep track of all of the items that each of its customers has bought. We can think of this as a data set (in the form of a matrix) in which each customer is a data point (a row) and each item is a feature (a column). So, if customer 10 buys item 15 then we put the number 1 in the box in row 10, column 15. We could use this form of the data, for example, to try to predict which customers will buy a particular item, since regression allows us to predict the value of an unknown feature (the item in question) on a new data point (the customer).
But it would be just as natural to arrange this data set so that the items are the data points (the rows) and the customers are features (the columns). Running regression on the data set in this form would allow us to predict what items a particular customer is likely to buy next, since the unknown feature is now a customer, and the new data points are the items.
In the end, though, these two different perspectives boil down to the same thing, and we can think of it as a form of duality: The customer sees the list of store items as things that they have or have not bought. The store item sees the list of customers as people who either have or have not bought the item.
Switching between these two (dual) perspectives amounts to switching the rows and columns of the matrix holding the data. The technical term for this is that after the switch, the new matrix is the transpose of the original matrix. Notice that if we do this twice – switching rows and columns once, then switching them a second time – we’ll get back to the original matrix. So, this is indeed a duality relationship like the one between the cube and the octahedron above.
More importantly, the take-away from this is that we can often think of the data points and features of a data set as being two different classes of ideas that are on par with each other. While the this sort of relationship isn’t always as obvious in other types of data sets, we can certainly always take the transpose of our data matrix (switching rows and columns) and treat this new matrix like we would treat any other data set. In fact, a number of the techniques that we’ve seen previously come very close to doing this. For example, Principal Component Analysis involves finding the eigenvectors of a matrix derived from the columns of a data matrix (namely the covariance matrix) while Spectral Clustering involves finding eigenvectors of a matrix derived from the rows of the data matrix.
If we think of the features and data points as being on par with each other, then we can think of the data matrix as expressing a mutual relationship between pairs of elements from each of these classes: The 1 in row 10, column 15 expresses the relationship defined by a purchase, and a zero would represent no such relationship. For other data sets, we might have to work harder to explicitly express this relationship, or we could just believe that it’s there without trying to define it. For data sets with values in a wider range than just zeros and ones, you can think of the numbers as expressing how strong this relationship is, though this gets a little dicey if you allow negative values. So we do have to be careful about exactly how we use it (just like any of the other algorithms I’ve written about!)
Where this perspective gets interesting is when we recall that there’s another data structure specifically designed for recording relationships between different things: graphs/networks. And indeed, we can represent our market basket data set as a graph as follows: For each customer of the online retailer, we will include a vertex in the graph. Plus, for each item that the retailer sells, we will include a vertex as well. Then, for each item that a customer has bought, we’ll draw an edge from the customer’s vertex to the item’s vertex.
Notice that this graph has two different types of vertices: customers and items. Lets pretend that the customer vertices are all colored blue and the item vertices are red. Each edge in the graph has its ends in two different types/colors of vertices – one blue (the customer who bought the item) and one red (the item that was bought). A graph that is colored in this way, such as the one shown to the right, is called a bipartite graph.
What’s interesting about this approach is that, unlike the matrices in which we had to decide whether the data points would be customers or items, the graph doesn’t discriminate between the two. The graph explicitly treats the two types of vertices as equals, on par with each other.
One area where this turns out to be particularly important is clustering. Note that we can run clustering on either form of the original data set: If we run a clustering algorithm on the form in which the customers are data points, then the clusters will tell us groups of customers that have bought many items in common. If we run it on the form in which the items are the data points, we’ll get groups of items that were bought by many customers in common.
But what happens if we run a graph clustering algorithm (such as spectral clustering or my TILO/PRC algorithm) on our combined (bipartite) graph? In this case, each of the resulting clusters will contain both blue vertices corresponding to customers and red customers corresponding to items. So each cluster tells us not only which customers have bought similar items, but what those similar items are. In other words, this algorithm not only gives us clusters like we would get from using the original data set, but it also gives us (essentially) an explanation for why it picked out those particular clusters.
In general, an algorithm that picks out clusters of both data points and features is called a co-clustering or biclustering algorithm, for hopefully obvious reasons. Forming a bipartite graph and running a standard clustering algorithm, like we did here, is a common approach to co-clustering, though by no means the only approach. While we could get similar information by running a standard clustering algorithm and then carefully analyzing the feature values in each of the resulting clusters, co-clustering can in many cases find better clusters. Moreover, co-clustering gives you direct evidence of which features were most important in determining the clusters, rather than having to infer this after the fact. So, while co-clustering won’t necessarily make sense in all situations, it can be a powerful tool, particularly for things like market-basket data, where there is a strong sense of data point/feature duality.