In my last two posts, I wrote about model interpretability, with the goal of trying to understanding what it means and how to measure it. In the first post, I described the disconnect between our mental models and algorithmic models, and how interpretability could potentially reduce it. In the second post, I laid out four things that a model interpretation should allow us to do – mitigate bias, account for context, extract knowledge and generalize. In this post, I want to discuss a number of desirable properties that have been suggested for model interpretations, and that might be used to judge whether and how much a model or explanation is interpretable.
As I noted previously, while there are many papers in the research literature that describe interpretable models, or ways of adding a layer of interpretability on top of existing models, most of these papers are not explicit about what they mean by interpretability. Two exceptions that I’ve found (though there are probably others) are Ribeiro, Singh and Guestrin’s paper on Locally Interpretable Model-agnostic Explanations (LIME) and Kulesza, Wong, Burnett and Stumpf’s paper on Principles of Explanatory Debugging (PED). The properties I’ll propose below are taken from these two papers, with different names and a few modifications.
It’s also worth mentioning a distinction that Zack Lipton makes in his Mythos of Model Interpretability, between transparency – explaining the way the whole model works – and post-hoc interpretability – explaining how individual predictions were made, without necessarily describing the process for calculating them. Even though most of what I’ve written previously on this blog has focused on transparency, this post and the rest of this series of posts will focus on post-hoc interpretability.
So, here are the seven properties. A post-hoc model interpretation should be:
Concise. Both the LIME paper and the PED paper point out that an explanation should not overwhelm the user with details about how a prediction was made. A central goal of machine learning is to deal with the complexity that we can’t or don’t want to incorporate into our mental models. If an interpretation forces the user to consider all the features and coefficients that went into the model, that defeats the purpose. An explanation should minimize the cognitive load required of the user.
For example, if a prediction can be explained by indicating 100 or 1000 factors that contributed to the value, it would be more concise to have the explanation algorithm pick out the top 5 according to some notion of importance, and only present those. Of course, if you make an explanation too concise, it may cease to be useful. So you’ll always want to balance making an explanation concise against the the other desirable properties, particularly:
Faithful. The explanation should accurately describe the way the model made the prediction. For example, you might try explaining the predictions from a neural network by separately training a linear model, then presenting the features that contributed to the linear model’s prediction (which are relatively easy to determine) as the features that probably had the most impact on the neural net’s prediction. This explanation would not be faithful, since it does not describe the actual model. The authors of PED use the term sound for this property, and the LIME paper calls it local fidelity.
Complete. An explanation is complete if it explains all the factors and elements that went into a prediction. In order to be concise, an explanation should never be absolutely complete, but you can measure how complete (how close to complete?) it is. The authors of the PED paper have an earlier paper, Too Much, Too Little or Just Right? that examines the tradeoff between a model being concise, faithful (sound) and complete.
Comparable. An explanation should help you compare different models to each other by examining how they handle individual examples. The LIME authors call this model agnostic. This is relatively straightforward for explanation algorithms that can be applied to multiple types of models, and defines the same form of explanation for all of them. It’s trickier for interpretable models that provide an explanation in a form that’s specific to that type of model. But if the explanation gives enough insight into how the prediction was made, it could still be comparable.
Global. The explanation should indicate how each individual prediction fits into the overall structure of the model. The LIME authors call this a global perspective. It’s related to the notion transparency I mentioned above – understanding how the whole model works. However, a global explanation only has to indicate enough of the model’s structure to provide reasonable context for each individual prediction. Is the predicted value especially high or low? Were there a lot of nearby points in the training set? Is the variance between these data points high? Is the prediction based on extrapolating from lots of unrelated data points (and thus less reliable)?
Consistent. Users will typically interact with a model over and over again. Ideally, they should learn more about the model over time and make better decisions as a result. This is a major theme in the PED paper, since it focuses on debugging models through repeated explanations. An explanation algorithm is consistent if each successive explanation helps the user to better understand later predictions. Perhaps more importantly, the user should never perceive a contradiction between different explanations. For example, if a feature increases a risk prediction for one data point but decreases it for another data point, the explanations should include enough information for the user to understand why.
Engaging. The PED paper notes that an explanation should encourage a user to pay attention to the important details, and they point to research showing (not surprisingly) that users who pay more attention to explanations become familiar with the model faster and ultimately make better decisions. Some of this is covered by the earlier properties: If the explanations aren’t concise, the users’ eyes will glaze over. If they aren’t faithful and consistent, the users will be distracted by the inconsistencies. But there’s more to being engaging than just those properties, and it should be considered separately.
This last property has a lot to do with how the explanation is presented rather than the explanation itself, so you could argue that it’s more of a User Experience (UX) issue. But in case you didn’t notice, this whole post has been secretly about UX – how users interact with an interface that happens to have an ML model behind it. Interpretability is as much a UX problem as it is a machine learning problem, which is part of what makes it both so interesting and so difficult.