Peter Hase
UNC Chapel Hill

Owen Shen
UC San Diego

With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post.

Introduction

Model interpretability was a bullet point in Concrete Problems in AI Safety (2016). Since then, interpretability has come to comprise entire research directions in technical safety agendas (2020); model transparency appears throughout An overview of 11 proposals for building safe advanced AI (2020); and explainable AI has a Twitter hashtag, #XAI. (For more on how interpretability is relevant to AI safety, see here or here.) Interpretability is now a very popular area of research. The interpretability area was the most popular in terms of video views at ACL last year. Model interpretability is now so mainstream there are books on the topic and corporate services promising it.

So what's the state of research on this topic? What does progress in interpretability look like, and are we making progress?

What is this post? This post summarizes 70 recent papers on model transparency, interpretability, and explainability, limited to a non-random subset of papers from the past 3 years or so. We also give opinions on several active areas of research, and collate another 90 papers that are not summarized.

How to read this post. If you want to see high-level opinions on several areas of interpretability research, just read the opinion section, which is organized according to our very ad-hoc set of topic areas. If you want to learn more about what work looks like in a particular area, you can read the summaries of papers in that area. For a quick glance at each area, we highlight one standout paper per area, so you can just check out that summary. If you want to see more work that has come out in an area, look at the non-summarized papers at the end of the post (organized with the same areas as the summarized papers).

We assume readers are familiar with basic aspects of interpretability research, i.e. the kinds of concepts in The Mythos of Model Interpretability and Towards A Rigorous Science of Interpretable Machine Learning. We recommend looking at either of these papers if you want a primer on interpretability. We also assume that readers are familiar with older, foundational works like "Why Should I Trust You?: Explaining the Predictions of Any Classifier."

Disclaimer: This post is written by a team of two people, and hence its breadth is limited and its content biased by our interests and backgrounds. A few of the summarized papers are our own. Please let us know if you think we've missed anything important that could improve the post.

Master List of Summarized Papers

Our Opinions by Area

  • Theory and Opinion

    • There has been good progress in the theory underpinning explainability. Insights from psychology, linguistics, and philosophy have helped authors understand slippery terms like "explanation" in the context of AI. This research has yielded new or clearer concepts to work with, e.g. simulatability, plausibility, (aligned) faithfulness, and (warranted) trust.

      We have seen strong arguments for developing explainable AI with special focus on the social nature of explanations, involving the use of mental models, models of the self, and theory of mind.

      Several papers have called for work on interpretability to be more strictly scientific, i.e. by asserting falsifiable hypotheses and measuring properties of explanation systems rather than proceeding solely with non-systematic qualitative analysis. These calls are made in response to trends essentially in the opposite direction.

      Theoretical work in the area has heavily influenced some subsequent work in evaluation and methodology, though a substantial fraction of papers tend to re-introduce desiderata or key considerations relevant to their methodology rather than directly drawing on prior theoretical work. Some methods papers do explain their goals with the latest terminology but use the terms in only the loosest of senses. There is plenty of ongoing debate about the terms and concepts currently in use in the field and the ultimate purposes of interpretation and explanation methods.

  • Evaluation

    • There are now many, many ways to evaluate explanations, including procedures for evaluating explanations of arbitrary format. The approaches include both automatic procedures as well human study designs, and the bulk of the work has focused on feature importance estimates. We are excited by many of the approaches, particularly those assessing whether explanations improve human-AI team performance at a task that is hard for either humans or AI alone.

      One trouble here is that there are so many evaluation procedures, it can be hard for methods papers to choose which to use. It at least seems like each evaluation procedure is equally likely to be used in any given methods paper (though there is a noticeable preference for automatic methods over human studies). We imagine this trend arises partly from the following situation: (1) there is not a common understanding of which explanation procedures answer which research questions; (2) methodologies are introduced without sufficiently precise research questions in mind. (Alternatively, papers can truly need their own new evaluation schemes, because they are answering new questions.)

      Here's an example of the above situation. There is a lot of confusion over what the actual object of our explanation is within the subarea of feature importance estimation. Several research questions present themselves: should feature importance estimates explain the role of features in (1) the behavior of particular trained model with fixed weights, or (2) the behavior of trained models obtained by a stochastic training procedure involving a model family and dataset, or (3) the solvability of a task, either in theory or with respect to a given training procedure, model family, and dataset? Each research question stems from a fundamentally different goal, but papers rarely distinguish between them. Do we want to learn about a given model, a family of models, or the true nature of a task? There is not yet a clear and commonly accepted set of evaluation procedures suited for each of these questions which papers on feature importance estimation can readily use. The result is that the literature is not nearly as cumulative as it could be. For any given research question, it is hard to find one-to-one comparisons between more than a couple papers which would help you tell which methods are well suited to answering the question.

      A similar situation arises with evaluations comparing generated explanations to human explanations. Many papers compare model explanations to human "gold label" explanations for data points. This is an evaluation for plausibility but not faithfulness, and this practice may reward deceptive explanation methods (since this is rating explanations by how convincing they are, but not by how truthful they are). But you could also be comparing your explanations to human explanations to figure out if your model is reasoning in a similar way to how humans reason. In many cases, this is a goal for our models, so that seems good. (Of course, this assumes your explanations are actually faithful to the model's reasoning.) We see a lot of papers that use human explanations as their "gold label" to compare against while not even saying whether they aim to optimize for explanation plausibility or alignment between model and human reasoning.

      Lastly, qualitative analysis remains very popular throughout methods papers. It would likely be a marginal improvement to the field if some standards for qualitative analysis were more widely adopted, or someone wrote something good about what those standards should be. We do not mind "expert evaluation (by the author)" of the kind where the authors carry out some systematic qualitative coding regarding their method performance, but this quickly looks less like standard qualitative analysis and more like a measurable outcome.

  • Methods (General Commentary)

    • I'll give just three points here.

      First, at a high level, there has been clear progress in methodology. There are cases where we can (1) identify concepts that certain neurons represent, (2) find feature subsets that account for most of a model's output, (3) find changes to data points that yield requested model predictions, (4) find training data that influences individual test time predictions, and (5) generate natural language explanations that are somewhat informative of model reasoning.

      Second, it seems like every sub-field of ML has its own stream of research, and they often don't cite one another. There's a cluster of work in ICML/ICLR/NeurIPS, and a cluster in NLP conferences, and a clustering in CV conferences, and they often don't cite extremely similar methods or evaluation procedures from other subareas. Of course these days finding all the relevant papers to your work is a daunting problem in its own right, but the literature seems much less connected and cumulative than it should be.

      Third, it is difficult to assess which other methods a paper should compare to when presenting its own new method. What if you say that a goal of your feature importance estimation method is "model debugging"? Though there are dozens of feature importance estimation methods you could compare to, you could also compare with counterfactual generation methods. Arguably every future paper with "model debugging" as a goal of their method should also compare to CheckList, a simple but extremely effective unit-testing approach, regardless of the category their method falls into. Yet this would require expensive (expert) user studies. It is much easier to use automatic feature importance evaluations and forget about other approaches to model debugging. Right now there is a serious trade-off between ease of evaluation and breadth of comparison for explanation methods, which is hampering comparison with relevant alternative approaches.

  • Methods: Estimating Feature Importance

    • Most of my thoughts on these methods are encapsulated in the examples in above opinion sections or represented in existing opinion papers. There are dozens of these methods, and they have a great variety of purported purposes.
  • Methods: Interpreting Representations and Weights

    • In this area we are most excited by approaches that find a map between vector representations and really clear-cut human concepts. A few examples in this section require additional human supervision over the concepts, but this seems like a worthwhile price to pay to know what the model is representing. Once we know the concepts being representing, we can work on understanding how the model uses them.
  • Methods: Generating Counterfactuals and Recourse Procedures

    • For people who are not satisfied with how they are treated by a machine learning system, "recourse" is a reasonable plan that they could follow to change how the system handles them for the better. A common example is automated loan approval, where someone might dispute whether they were reasonably denied a loan by a ML system. This area of research feels particularly important both because providing recourse is good and because working with counterfactuals is key to building a good causal model of a machine learning system. Everyone interacting with machine learning systems in the future should hope that good recourse methods will be available (and put into practice by people using ML in the world — which is another concern). Meanwhile, counterfactual generation will be a core part of understanding model errors in complex domains. A key part of answering questions about what causes model behavior is being able to identify the right counterfactual situations and see how the model handles them.
  • Methods: Explanation by Examples, Exemplars, and Prototypes

    • This area enjoys some popularity due to how people like explaining things with examples and reasoning by analogy to historical data. This seems like a good approach to consider alongside other methods, but we would like to see more systematic evaluations out of papers in this area.
  • Methods: Finding Influential Training Data

    • I think this will be an increasingly valuable style of explanation, especially as training datasets grow rapidly in size and are collected with imperfect screening heuristics. These methods lay the basis for explaining, debugging, and debiasing model behavior, as they can be combined with approaches to making quick adjustments to trained models (to roll back the influence of bad training data, for example).
  • Methods: Natural Language Explanations

    • In our opinion, we think this is a critical subarea for AI Safety. While we are making headway by means of visualizations of model reasoning, eventually natural language will be a complementary or preferable medium for communicating model reasoning. Beyond simply being a flexible medium for communication, methods for natural language explanation will set the ground for more interactive, dialogue-based interaction with systems for interpretability-related purposes. However, this area still has basic unsolved problems in methodology and evaluation which merit much more work. For instance, it is not fully clear how we would train models to truthfully reveal their internal reasoning to people via natural language, even in a particular task domain. An interesting related direction is to provide explanations in formal languages, rather than natural ones, which could offer some advantages over natural language (along with some notable trade-offs).
  • Methods: Developing More Easily Interpreted Models

    • This is an interesting and potentially useful area of research. So far, it seems like training models on vast amounts of visual and linguistic data leads to the models learning many crisp human-interpretable concepts (uncovered upon inspection). Is this a guaranteed property of training models on natural or human-generated data? How many of the concepts these models represent are not easily interpreted, for whatever reason, and can this situation be avoided?

      This area also includes strong skeptics of explaining deep learning models, which can seem both fair when considering all the shortcomings of the explainability literature and unfair when considering all the clear progress we've seen the past few years.

      Overall, we think this is a promising area, but we are also not convinced work from this area will ultimately influence how the highest-performing ML systems will be developed.

  • Methods: Robust and Adversarial Explanations

    • There are a few distinct things happening in this area. A few exceptional contributions involve work on (1) robustness to distribution shifts, i.e. explaining models in a way that explanations are still faithful even under distribution shifts, (2) deceptive models, where a specially trained model can deceive an explanation procedure into thinking it is not heavily biased with respect to certain features, when in fact it is, and (3) searching for data points that yield logically inconsistent model explanations under some explanation procedure, which is a pretty unsatisfactory state to end up in. These are all clearly important solutions/problems.

      I've seen a few papers suggest that explanations for model predictions should be robust to small input perturbations in principle. This seems like a mistake, likely one of valuing plausibility over faithfulness. If model behavior is influenced by small perturbations, then explanations should describe this phenomenon, not smooth it over.

      This area feels especially important going forward, and we would particularly like to see more work on how models might adversarially or accidentally hide information from being revealed by explanation procedures.

  • Explaining RL Agents

    • To date, most of the work here has been applying approaches from the Feature Importance Estimation and Counterfactual Generation literature, though there is also an interesting line of work focusing on causal models of agent behavior. Some interesting early results have emerged regarding the kinds of explanations that help users build better mental models of agents, but so far this area is so new that it remains to be seen what the most promising approaches are. Many of the concerns in the above Evaluation section also apply here.

      I have come across surprisingly few papers in this area relative to its importance. There appear to be important questions unique to explaining agents (rather than classifiers). For instance, explaining agents' behaviors will require special consideration of actions' temporal dependence, agent "plans", and epistemic vs. instrumental rationality. And the whole exercise will be complicated by multi-agent scenarios. This area really merits a lot more work, and for people interested in long term AI safety and existential risks, this is plausibly the most important subarea of interpretability research.

  • Interpretability in Practice

    • Eventually interpretability techniques have to get used to make any difference in the world. This could require interpretability interfaces that are interesting to explore or corporate/public policy mandates for systems passing certain transparency tests. This section covers a bit of both. If interpretability techniques were so helpful that researchers actually relied on them in their day-to-day research for solving problems, that would be great. But it seems we are not there yet.

Paper Summaries

Theory and Opinion (5)

  • Section Highlight: Explanation in Artificial Intelligence: Insights from the Social Sciences
    • 2018

    • This paper is a (long) survey of research on explanations coming from philosophy and the social sciences, and it communicates important results from these fields and comments on connections with XAI.

      To begin with the terminology: drawing directly from past works, the author defines an explanation as "an answer to a why-question" and interpretability as "the degree to which an observer can understand the cause of a decision." They "equate 'interpretability' and 'explainability.'"

      The paper communicates four key research findings:

      (1) Explanations are contrastive and invoke counterfactual cases (even if implicitly). Most authors in the area argue that all why-questions ask for contrastive explanations, even if the "foil" to the fact of the situation is not explicitly stated. Questions such as "why did she close the door?" have an infinite number of possible implicit foils, and someone offering an explanation will implicitly rely on one as they explain the event (e.g., why did she close the door rather than leave it open?) One natural choice of foil is the more expected event, relative to the observed event.

      (2) Explanations are given by agents after they are selected from a set of possible explanations, according to certain preferences. These preferences include (at least) coherence, simplicity, generality, soundness, and completeness — which are often in tension with one another. Following conversational norms, explanations are often restrained to be relevant to a particular question and only give what is necessary to answering the question. The abnormality of cited causal factors is an important aspect too, even more important than our confidence in the causal links. For example, when explaining the famous Challenger space shuttle explosion, people would not reference the oxygen in the air as an explanation, even though it is more certainly a necessary condition for the explosion than the suspected cause of some faulty seals. Similarly, we typically do not seek explanations at all for unsurprising events.

      (3) Probabilities are not as important as causes. An example: if a student wants to know why they got a 50% on a test, they will not be satisfied if told that most people in the class got around a 50%. It would be better to explain why most students got around a 50%, but it would be best to explain why that particular student got a 50% (i.e. what caused the grade). In general, giving statistical statements could be helpful, but it is critical to give the causal factors underlying the statistical generalizations.

      (4) Explanations are social: the act of explaining is a transfer of information about an event's causal history between two people. Further, taking a conversational model of explanations helps us extend our understanding of conversation to the act of explanation, e.g. in applying Grice's maxims of communication and analyzing the role of linguistic markers which indicate perspective and manage listener expectations. Core to this view is the concept of theory of mind, since we now view explanations as communicative acts between an explainer and explainee designed to fill in gaps in the explainee's mental model. Implicature (leaving things as implied) is additionally of relevance. The conversational framework can be extended to what is more fundamentally an argumentative framework, where the explainer is arguing for the explainee to draw certain conclusions. The authors suggest this type of interactive format will be valuable in XAI regardless of the medium of communication.

      The paper ultimately concludes that researchers in AI should better heed the work on explanations from adjacent fields, particularly philosophy, psychology, and cognitive science. If we leave it to almost exclusively computer scientists to design explainable AI, we risk the result of "the inmates running the asylum" — i.e., a situation where "the very experts who understand decision-making models the best are not in the right position to judge the usefulness of explanations to lay users."

      The paper addresses a number of other interesting topics, including the possible levels of explanation (using Aristotle's four causes), the process by which people select or infer foils (e.g. by a notion of similarity in causal histories), and the relevance of folk psychology to AI (including the nature of social attribution of intentionality to AI systems), among others.

    • Artificial Intelligence

  • Chris Olah’s views on AGI safety
    • 2019
    • Evan Hubringer summarizes Chris Olah's view on how interpretability can help AI safety. Olah gives four main reasons. The first is the benefit of post-hoc interpretability as a sort of "mulligan", allowing us to query models for issues prior to deployment. The second is that if interpretability helps us better understand how models work, then we can use these insights to better inform model construction and design from the get-go. The third is that interpretability can be used as an additional means of feedback during training, to see errors as they arise. The fourth is the view that an interpretability-focused method of constructing AIs could lead to what he calls "microscope AI". Instead of the typical agent-based approach of taking actions in the world, Olah envisions microscope AIs as more like powerful knowledge generation tools where the actions are still taken by humans and not the model. Olah' gives two points underlying this optimistic view of interpretability. One, good interpretability is possible even for very large models. Two, the current difficulty in interpreting large neural nets is due to model constraints; Olah believes that as models get larger, they will be able to express concepts in a less confused way. To improve interpretability research, Olah created Distill, an online journal that focuses on papers which clarify and interpret. He ends up acknowledging that interpretability research will likely also speed up capabilities research, but he is betting that gains from improved model design will win out over other automated approaches, which would be net-positive, even with the speedup.
    • Alignment Forum
  • Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?
    • 2020
    • The authors discuss two key concepts in the interpretability umbrella, faithfulness and *plausibility. "*Plausibility refers to how convincing the interpretation is to humans, while faithfulness refers to how accurately it reflects the true reasoning process of the model." Besides arguing that faithfulness is a preferable quality to plausibility, the authors make a few key points regarding current work on interpretability: (1) Faithfulness evaluations should not involve human ground-truth explanations. (2) Claims about "inherent interpretability" do not exempt methods from evaluation, and claims about interpretability should not be accepted until suitable evidence is provided. (3) In human-AI teams involving model explanations, increased team performance may not reflect explanations' faithfulness but instead a correlation between plausibility and model performance. They also tease out assumptions present in prior work, e.g. that "two models will make the same predictions if and only if they use the same reasoning process," which has been employed in proofs by contradiction against an explanation procedure's faithfulness. Finally, they urge a practice of interpretability research "that allows us the freedom to say when a method is sufficiently faithful to be useful in practice."
    • ACL
  • The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
    • 2020
    • The authors argue that we should stop using attention weights as explanations because information may mix within each axis of hidden states, meaning that attention weights do not necessarily represent reliance on the part of the input that corresponds to their index. In other words, the attention weight on a given token in a sequence will only represent reliance on that token's representation at the very first layer of a network, and not necessarily at any layer thereafter, because representations deeper in the network are a function of the entire input sequence. Other issues with attention, as observed by prior work, include that attention is not a causal explanation in the sense that attention weights cannot be altered while keeping all else equal about a model's input and forward pass. They argue for the use of saliency methods instead, which are generally designed with the goal of feature importance estimation in mind. However, they note that "at least some of the saliency methods are not reliable and produce unintuitive results (Kindermans et al., 2017) or violate certain axioms (Sundararajan et al., 2017)." They also suggest that the feature space used for explanation, e.g. per-token representation, has so far been fundamentally limiting saliency methods, and they point to initial work on capturing feature interactions in explanations (Janizek et al., 2020).
    • EMNLP BlackboxNLP
  • Aligning Faithful Interpretations with their Social Attribution
    • 2020

    • This paper argues that the faithfulness condition for model interpretability is underdefined, reformalizes the notion of faithfulness, and proposes an explanation framework that better satisfies their reformalized faithfulness. On the first front, they clarify that model explanations are best understood as faithful if they attribute the correct causal chain to the model's decision process (described as causal attribution). A commonly desired feature of explanations, they observe, is not represented in this notion of faithfulness: that the causal chain is aligned with what is expected by people (described as social attribution). These concepts are applied in a case study of select-predict methods for text classification, which are composed of a selector module that extracts subsets of a text to pass to a text classifier for a final prediction. They find that existing select-predict methods actually produce selections (which are masks over the text) that can themselves be used to predict the label, relying on the masks alone and not the selected tokens. They describe this phenomenon as a special case of the general "Trojan explanation," where the explanation encodes information that influences model behavior in a way not naturally anticipated by people. The selection masks are faithful explanations, in the sense that we understand the role they play in the model, but when masks are predictive of the label, the select-predict method is not aligned. Specifically, it can be unaligned in two ways: (1) people expect the selections to be summaries that keep relevant information for the task while filtering out irrelevant information, or (2) people expect selections to be evidence in support of a prior decision, without having been a part of the decision. The select-predict methods violate both expectations in the same way, namely by influencing the decision of the prediction module by selecting tokens that favor a certain class.

      In response, the authors propose a select-predict-verify approach. They consider a special setting where a model makes a prediction using a full text, and people provide text highlights after the fact that they think should point the model to a counterfactual decision (distinct from the observed prediction). Their approach is to find the minimal selection that is a superset of the human's suggestion and leads the model to predict the human's expected class. This approach better satisfies the faithfulness and alignment conditions, while providing for specific downstream use cases: users can (1) check that when relying on the "correct" evidence, the model would reach the expected conclusion, (2) find what evidence would be needed to correct a model prediction, (3) explore whether people would interpret evidence similarly to the model, when they are uncertain about the true label and want to rely on the model for advice. Lastly, our interpretation of their discussion section is that the authors believe aligned faithfulness is distinct from simulatability because simulatability can be high if people learn how explanations relate to model decisions, even when the relationship does not match initial human expectations, which is a case where explanations are not aligned.

    • arxiv

Evaluation (9)

  • Section Highlight: Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction
    • 2020
    • The authors run an RCT to see how different model explanation approaches can help with human in-the-loop prediction, as well as trust in the model. The prediction task is on the APP-REAL dataset which consists of over 7,000 face images and age labels. The experiment has two base conditions, one where users are asked to give an age prediction, and one where users are asked to give a prediction and are also shown the model's guess. The explanation groups were shown one of three explanations in addition to the model's output: a saliency map from the actual model (calculated with Integrated Gradients), a saliency map from a modified dataset with spurious correlations, and a random saliency map. Before collecting data, the authors ran a two-tailed power analysis using prior guesses on the dataset. The experiment also varied the framing, with the following three modifications: (1) Delayed Prediction, which asked for a user's guess, showed the model output, and asked for a revised user guess; (2) Empathetic, which described the model's output in a personified way; and (3) Show Top-3 Range, which output an age interval. The experiment was conducted on Amazon Mechanical Turk with 1,058 participants. Overall, participants were more accurate at guessing people's ages when they had access to the model's guesses, but having explanations of the model outputs did not further improve their accuracy. The authors note that this is likely because explanations had little effect on user trust in the model's outputs. The trust that participants had in each model differed only slightly between conditions, regardless of whether explanations were the real saliency maps or randomly generated (there is a slight trend but it is not statistically significant). In fact, participants found explanations to be "reasonable" even when they focused on the background and not on the face. The authors give quotes from participants explaining their reasoning processes. One participant, for example, noticed that explanations could appear faulty but thought the model's guesses seemed reasonable otherwise, so they "sort of went with it."
    • arxiv
  • Comparing Automatic and Human Evaluation of Local Explanations for Text Classification
    • 2018
    • The author carries out both automatic evaluations and human forward simulation tests for explanation methods with text classifiers (a logistic regression and MLP model). Methods include LIME, word omission, and gradient-based saliency. The automatic evaluation measures how a model's class score declines as tokens selected as important by each explanation method are removed from an input, culminating in the "Area under the Perturbed Curve" (AOPC) (the class score differences are computed for removing words and then averaged). In the forward simulation test, workers are shown input movie reviews and asked to predict a model's predicted binary sentiment, while being shown explanations in the form of highlighted words in the text (without directional/class information). Word omission outperforms LIME and gradient saliency on AOPC; in the simulation task, gradient saliency achieves the highest simulation accuracy of 79% on one dataset, while word omission explanations yield 86.8% accuracy on another dataset.
    • NAACL-HLT
  • Do explanations make VQA models more predictable to a human?
    • 2018
    • The paper presents a human subject experiment for evaluating the forward simulatability of a model given various explanation methods, using a Visual Question Answering task. They consider two simulation targets: the model's binary correctness, and its particular predicted output. They evaluate explanation methods including Grad-CAM, visualized attention weights, and an "instantaneous feedback" condition where no explanation is included, but the simulation target is revealed to the human subject after every response. They find that the explanation procedures do not yield statistically significant improvements in accuracy, while the instantaneous feedback procedure yields large improvements (30 ppts simulation accuracy for predicting model outputs). Human performance on predicting the VQA model's correctness is not as high as an MLP trained to predict the VQA model's correctness using the VQA model's softmax layer's output as features (~80% accuracy), but the instantaneous feedback conditions are close, with around 75% failure prediction accuracy.
    • EMNLP
  • Sanity Checks for Saliency Maps
    • 2018
    • The authors propose two methods to validate saliency maps, an interpretability technique that visually highlights regions of the input that can be attributed to the output. The authors point out that a good saliency map should be sensitive to both the actual model and the input labels; changing either of these should lead to a different map. Eight different saliency map techniques are evaluated: the Vanilla Gradient, Gradient ⵙ Input, Integrated Gradients, Guided BackProp, GradCAM, and SmoothGrad (plus two special cases). The authors run two experiments following their above conjecture. The first randomizes the last N layers' weights in the model, where N = 1 corresponds to only randomizing the last layer, and when N = model size, all weights are random. The reasoning here is that a good saliency map should be a function of the model, and not of just the input (e.g. acting like a model-agnostic edge detector). Comparison between the original saliency map and the new saliency map (on the randomized model) is done through visualizing both maps, as well as quantitatively via Spearman rank correlation, the structural similarity index measure, and the Pearson correlation of the histogram of gradients. In this first experiment, the authors find that the Vanilla Gradient is sensitive while Guided BackProp and Guided GradCAM show no change despite model degradation. The second experiment randomizes the labels of the input data and trains a new model. The reasoning is that saliency maps should also be sensitive to the true model; outlining a bird in the image, for example, is not useful if the true label is "dog". The model is trained to at least 95% training accuracy and then the saliency maps are applied. Again, the Vanilla Gradient shows sensitivity. Integrated Gradients and Gradient ⵙ Input continue to highlight much of the same input structure. Both experiments were conducted on a variety of models and datasets, including Inception v3 trained on ImageNet, CNN on MNIST and Fashion MNIST, MLP trained on MNIST, and Inception v4 trained on Skeletal Radiograms.
    • NeurIPS
  • A Benchmark for Interpretability Methods in Deep Neural Networks
    • 2019

    • From Alignment Newsletter #101:

      This paper presents an automatic benchmark for feature importance methods (otherwise known as saliency maps) called RemOve And Retrain (ROAR). The benchmark follows the following procedure:

      1. Train an image classifier on a dataset (they use ResNet-50s on ImageNet, and get about 77% accuracy)

      2. Measure the test-set accuracy at convergence

      3. Using the feature importance method, find the most important features in the dataset, and remove them (by greying out the pixels)

      4. Train another model on this new dataset, and measure the new test-set accuracy

      5. The difference between the accuracy in (4) and in (2) is the measure of how effective the feature importance method is at finding important features

      The idea behind retraining is that giving the original classifier images where many pixels have been greyed out will obviously result in lower accuracy, as they're out of the training distribution. Retraining solves this problem.

      They benchmark a variety of feature importance methods (Gradient heatmap, Guided backprop, Integrated gradients, Classic SmoothGrad, SmoothGrad^2, VarGrad) on their benchmark, and compare to a random baseline, and a Sobel Edge detector (a hard-coded algorithm for finding edges in images). Only SmoothGrad^2 and VarGrad (which are both methods which ensemble other feature importance methods) do better than random. They can't explain why these methods perform better than other methods. They also note that even when removing 90% of the pixels in every image (i.e. the random baseline), the accuracy only drops from 77% to 63%, which shows how correlated pixels in images are.

    • NeurIPS

  • Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
    • 2020

    • From Alignment Newsletter #100:

      In this paper the authors perform user tests on 5 different model agnostic interpretability methods: LIME, Anchor, Decision Boundary, Prototype Model and a Composite model (LIME Anchor and Decision Boundary). The use cases they test are a tabular dataset predicting income, and a movie-review dataset predicting sentiment of the review from a single sentence.

      Their experimental setup consists of 2 tests: forward prediction and counterfactual prediction. In forward prediction, the user is shown 16 examples of inputs and corresponding outputs and explanations, and then must predict the model’s output on new inputs (without the explanation, which often gives away the answer). In counterfactual prediction, after seeing 16 examples, the user is given an input-output-explanation triple, and then must predict how the output changes for a specific perturbation of the input.

      Throughout the results they use a significance threshold of p < 0.05 (they don't use Bonferroni corrections). Their study has responses from 32 different students who'd taken at least 1 computer science course, with some screened out for outliers or low accuracy during training. There are approximately 200 individual predictions for each method/dataset-type combination, and each method/prediction-type combination.

      Overall, their results show that only LIME (Local Interpretable Model-agnostic Explanation) helps improve performance with statistical significance on the tabular dataset across both prediction settings, and only the Prototype model in counterfactual prediction across both datasets. No other result was statistically significant. The improvement in accuracy for the statistically significant results is around 10% (from 70% to 80% in the Tabular dataset with LIME, and 63% to 73% for Prototype in counterfactual prediction).

      They also showed that user's ratings of the explanation method didn't correlate in a statistically significant way with the improvement the model gave to their predictions**.**

    • ACL

  • ERASER: A Benchmark to Evaluate Rationalized NLP Models
    • 2020
    • This paper proposes a benchmark for rationales of text classification models, where a rationale is a binary mask on the input (i.e. a "highlight" of important words in the input). The benchmark is a collection of existing text datasets, most of which have human annotations for what the "important" words in data points are. They suggest measuring three aspects of model rationales: (1) their agreement with human rationales, (2) their comprehensiveness, defined as the change in a model output by masking out important words (relative to original input), and (3) their sufficiency, defined as the change in model output by keeping only the important words (relative to original input). They evaluate simple methods as baselines for future work.
    • ACL
  • On quantitative aspects of model interpretability
    • 2020

    • This paper proposes a few quantitative metrics for explanation methods, which they hope will be used for method development and selection before final testing via human studies. They give separate metrics for feature extraction, feature attribution, and example-based methods. The metrics themselves are not particularly novel: the feature extraction metrics focus on mutual information between the extracted features and the input or predicted class. With attribution methods, which assign scores to tokens, they test how the scores relate to model outputs under a variety of input ablation procedures (omitting certain features from the input). The metrics for example-based explanations are similar to some of those in the DiCE paper.

      There are a few interesting points in the paper. They evaluate several explanation methods using a known non-linear function, so that we know the true effect of features on the function outputs, and find that the Integrated Gradients method makes some mistakes which simpler gradient-based saliency methods do not (for a single function and data point). Commenting on the Remove-and-Retrain (ROAR) procedure of Hooker et al., they suggest that this procedure might better be viewed as data interpretation rather than model interpretation, since the ROAR scores for an explanation method will be highly dependent on whether a task/dataset is still solvable after certain features are removed. And they give an interesting desideratum for feature attribution methods: that predictions should be more precise when a given feature is known, relative to not knowing that feature, and that precision should correlate with the magnitude of the attribution score. They take care to distinguish this desideratum from other metrics which measure how knowing a feature moves predicted probabilities up or down, rather than the precision or confidence in predicted probabilities.

    • arxiv

  • Manipulating and Measuring Model Interpretability
    • 2021

    • In a large (n=3800) pre-registered study with high reputation MTurkers, the authors explore how two factors, number of features in a model and model transparency, relate to three outcomes: simulatability, deviation, and error detection (to be explained below). To illustrate each aspect of the experiments, consider their experimental protocol:

      Users are given eight features of an apartment for sale in New York City, and are asked to guess what a model will predict its selling price to be. The model is a linear model that uses either 2 or 8 of the 8 available features. Users are assigned into conditions using either the 2 or 8 feature model, and then further divided into a transparent condition where they see the model weights during the whole experiment or a blackbox condition where they never see the model weights. For each of the resulting four conditions, users are first asked to guess the model's prediction, and then they are shown the model's prediction and are asked to guess the true selling price of the apartment.

      Comparing their guesses of the model's prediction and of the true price allow the authors to derive their three outcomes. Here, simulatability is measured as the user's accuracy at predicting the model output. Deviation is measured as the extent to which the user deviates from the model prediction in making their own guess at the apartment's true selling price. Lastly, error detection is measured for a particular subset of the apartments where the model prediction is overtly incorrect (the apartment features are outlying, in these cases). Users are said to detect the error when they deviate greatly in the direction of the correct value, and otherwise to not have detected the error.

      Several results follow. First, users in the transparent conditions are better able to simulate the model than in the blackbox condition, and in both conditions it is easier to simulate the 2-feature model than the 8-feature one. Interestingly, users deviate from model predictions to the same degree in both transparent and blackbox conditions. And surprisingly, users detect model errors less frequently in the transparent condition than the blackbox condition. In further analysis of this last finding, they authors found that highlighting abnormal features of the apartments in the user interface effectively erased the difference in error correction between conditions, which they describe as "consistent with the idea that transparency can be overwhelming and cause users to overlook unusual cases."

    • CHI

Methods

Estimating Feature Importance (10)

  • Section Highlight: Neuron Shapley: Discovering the Responsible Neurons
    • 2020

    • From Alignment Newsletter #95:

      This paper presents a novel method, Neuron Shapley, that uses the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output. (Shapley values have been applied to machine learning before to measure the importance of features to a model's output, but here the authors use them to calculate neuron importance.) Due to several novel approaches and optimizations in calculating these Shapley values, the top k most responsible neurons (k ~ 30) can be feasibly found for large networks such as Inception-v3.

      The authors demonstrate that finding these neurons enables the performance of model surgery. Removing the top 30 neurons that contribute to accuracy completely destroys the accuracy, whereas in expectation removing 30 neurons at random from the network barely moves the accuracy at all. Since the method can be applied to an arbitrary metric, this kind of surgery can be performed for other metrics we care about. For example, removing the neurons which are most responsible for vulnerability to adversarial attacks makes the network more robust, and removing the neurons most responsible for the class-accuracy imbalance (a fairness metric) makes the classes much more even, while only reducing the overall accuracy a small amount.

    • NeurIPS

  • Anchors: High-Precision Model-Agnostic Explanations
    • 2018
    • The authors introduce Anchors, which are if-then rules over inputs, as an alternative to local linear explanations, with the premise that local model behavior can be highly nonlinear. An example Anchor explanation for a model prediction of positive sentiment for a sentence "This movie is not bad" is given as a probabilistic statement, e.g., . Anchors are identified for particular model predictions by a PAC algorithm searching for rules over an input representation (like bag-of-words for text or pixels for images) that predict the observed model label with high confidence, using a local perturbation distribution around a particular input to get data for estimating rule accuracy. Relative to local approaches like LIME, the authors suggest that Anchors are easier to understand, have extremely clear coverage (whether they apply to the input or not), and are high precision by design (if an Anchor applies, confidence in predicted result is high). They present results for models for text and tabular classification, structured prediction (a part-of-speech tagging task), image classification, and visual question answering (VQA). Their first evaluation is to get LIME and Anchor explanations for validation data in tabular classification tasks, then automatically apply them to test data and check if their suggested predictions match the model's predictions. They find that the Anchor predictions do indeed match model predictions at high (>90%) rates, though Anchor coverage is relatively low, applying to less than <30% of test data. They propose a submodular pick (SP) algorithm to efficiently covering the space of inputs with Anchors, and find that they can cover over half of the tabular data spaces with 10 explanations (though data spaces are low-dimensional). In a human simulation test with ML students using tabular and VQA data, they find that showing explanations (for validation data) to users can improve user ability to predict model behavior on test data, relative to a baseline condition without explanations. Precision jumps from 50-60% to 90%+ across conditions; simultaneously, Anchor users also become more conservative, making predictions for between 3 and 40 percent fewer instances (lower perceived coverage). With LIME, the effect on precision is mixed and generally smaller.
    • AAAI
  • Explaining a black-box using Deep Variational Information Bottleneck Approach
    • 2019

    • The paper introduces an explanation approach with the aim to select parts of an input that can be used to predict a blackbox model's output for the entire data point. The selection is made by their explainer model, which is optimized to trade-off between making selections informative of the label and keeping selection brief. These goals are formalized via mutual information. Since this formulation is intractable to optimize for directly, they optimize for a variational bound for this objective, which looks like maximizing the likelihood of the blackbox model's predictions under an approximator model, while regularizing the size of the selections and encouraging sparsity. That is, the explainer masks the input and the approximator predicts the blackbox model's output given the masked input. The explainer and approximator are jointly trained; masks given by the explainer are made differentiable by means of the GumbelSoftmax estimator, which is used to select exactly k (hyperparameter) elements from the input (a continuous approximation of an n-choose-k sample). Note that mask elements they use after the GumbelSoftmax are still in the unit interval (. Lastly, note that the units of selection will include, for text data: words, groups of words, or sentences; and for images: squares of pixels.

      Their quantitative evaluations include models of biological data, MNIST, and IDMB sentiment analysis data. With the biological data, they identify a simple heuristic based on explanations of model predictions that they use to either accept or reject model projections. It seems that they check whether explanations for test data "match" explanations from another dataset in particular situations, and if so, they accept the prediction; among accepted predictions, test accuracy is higher than rejected predictions.

      For MNIST alone, they ask graduate students at CMU with a background in ML to rate explanations on a scale of 0 to 5, where 0 corresponds to "No explanation", the intermediate range to "Insufficient or redundant explanation" and 5 to "Concise explanation." Their method gets an average of 3.53 (next best: SmoothGrad, 3.45).

      For IMDB alone, they do a simulation test with MTurkers, where they ask users to predict the model output given the explanation only (using only correctly predicted data). With their method, users achieve 44.7% accuracy (next best, L2X: 35.6%, random is 33%).

      For both MNIST and IMDB, they measure "approximator fidelity" (approximator accuracy at predicting blackbox output) and "rationale fidelity" (approximator accuracy using hard masks rather than continuous masks). Using hard rather than continuous input masks does not greatly reduce the approximator's accuracy. They do not find any statistically significant gains in approximator fidelity over the most similar existing method, L2X, but they do tend to find increases in rationale fidelity of between 2 and 10 percentage points in most situations for both datasets.

    • arxiv

  • Weight of Evidence as a Basis for Human-Oriented Explanations
    • 2019
    • The authors examine how human explanations often focus on desiderata like contrastiveness (i.e. why X instead of Y?) which are often missing from existing interpretability approaches. They give a list of five desiderata (contrastive, modular and compositional, does not confound base rate with likelihood, exhaustive, and minimal) and then give an evaluation metric that satisfies all five. The authors describe the weight of evidence (WoE), which is defined as where is the evidence observed and and are the two hypotheses under consideration. The authors use WoE in a meta-algorithm for multi-class explanation which iteratively finds the subset of classes with the greatest WoE, "explaining away" the other classes; this continues until only the predicted class is left. They utilize this algorithm on the Wisconsin Breast Cancer dataset as well as MNIST. The authors give some visual examples where their algorithm identifies key parts of the input, but do not quantify their results.
    • NeurIPS Workshop on Human-Centric Machine Learning
  • Interpretable Neural Predictions with Differentiable Binary Variables
    • 2019
    • In a select-then-predict format, the authors propose a masking model that restricts the parts of an input that a jointly trained text classifier (or regression model) has access to. The masking model parametrizes masking variables that take values in the closed unit interval, which are multiplied with the token input representations. The masking and task models are learned end to end via the reparameterization trick for a newly proposed random variable, the HardKuma. They use BiRNNs for the task model component; individual mask values are conditioned on the preceding mask values via an RNN. During training, a sparsity loss encourages masks to have expected sparsity close to a specified hyperparameter. In experiments on a multi-aspect sentiment regression and sentiment classification tasks, they find that (1) their random variable outperforms a Bernoulli used with RL for learning in terms of accuracy per number of selected tokens, and (2) their unmasked/selected tokens are included in human-provided important-word highlights 80-98% of the time. In an experiment using the HardKuma for masking cross-attention weights in a model for NLI, they attain 8.5% non-zero attention weights at a trade-off of losing 1 percentage point of task accuracy.
    • ACL
  • Evaluations and Methods for Explanation through Robustness Analysis
    • 2020
    • The authors propose a new definition of robustness which decomposes into two concepts: robustness in the space of features deemed important and robustness in the space of features deemed unimportant. Borrowing ideas from adversarial robustness, the authors propose a notion of feature importance based on how sensitive the label is to perturbations of the feature in question. They use a dual evaluation metric, which is to minimize the adversarial perturbation needed when applied to the relevant features and maximize the adversarial perturbation needed when applied to the irrelevant features. The authors propose two greedy methods to solve this feature importance problem. The first is to greedily select one feature at a time, locally improving the goal at every step. The second is to train a regression function based off of random subsets of features to learn feature interactions. This helps address the situation where a combination of features are much more informative than any one feature in isolation. Their method of feature importance is then evaluated on ImageNET, MNIST, and a Yahoo! Answers text dataset and compared to several other baseline methods: vanilla gradient, integrated gradient, leave-one-out, SHAP, and black-box meaningful perturbation (BBMP). The authors find that their method outperforms the other methods on their proposed dual metric. Qualitatively, the authors show that on the image datasets, their method produces attributions that are visually less noisy. They conclude with a sanity check by randomizing the model's last set of weights and confirm their method is sensitive to these changes.
  • Adversarial Infidelity Learning for Model Interpretation
    • 2020
    • The authors introduce Model-agnostic Effective Efficient Direct (MEED), a new method for model-agnostic interpretability for instance-wise feature selection. MEED consists of an explainer, which uses a feature mask to select important features, an approximator (which uses the selected features to approximate the original model), and an adversarial approximator (which uses the inverse of the selected features to approximate the original model). The authors minimize the mutual information between the original model's output and the adversarial approximator's output. This model is evaluated on five datasets: IMDB sentiment, MNIST, Fashion MNIST, ImageNet for Gorilla vs Zebra, and mobile sensor data from a Tencent mobile game. The authors compare their method with LIME, kernel SHAP, CXPlain, INFD, L2X, and VIBI. They evaluate the feature importance by comparing the true model output with four different outputs: the output of the original model on only the selected features (and everything else set to 0), the output of an approximate model trained specifically on the selected features, the output of the original model on only the unselected features, and the output of an approximate model trained specifically on the unselected features. If the feature selection went well, then the true model output should tend to agree with the first two outputs and disagree with the last two. The authors also compare with human output, where users are given the selected features and asked to predict the output. Across all five tasks, the authors demonstrate that their feature selection method generally performs better than the other methods.
    • KDD
  • CausaLM: Causal Model Explanation Through Counterfactual Language Models
    • 2020

    • This paper proposes to explain the causal effect of features on model outputs, particularly high-level features that are not easily manipulatable in the input. In cases where the feature of interest, such as the use of adjectives in a text, can be easily omitted without influencing the remaining features in the input, this causal effect can be easily estimated as the difference in model outputs with and without this feature. But when altering or removing features violates the data generating process, e.g. by destroying the grammaticality of a text while trying to remove information about its topic, the causal effect of this kind of high-level feature cannot easily be estimated through a counterfactual data generation scheme.

      Lacking access to a generative model that can remove high-level features from text without influencing any other features, the authors propose to compare model outputs using the original data representations and counterfactual data representations. These counterfactual representations are obtained from a counterfactual model which is trained for the task at hand with an adversarial loss term encouraging it to forget the target concept (the one whose effect is being explained). This requires supervision at the instance or even feature-level. An example: to calculate the effect of the adjective part of speech on a text classifier, the authors train their counterfactual model to perform the task at hand while ensuring that the model's final representations do not contain any information about whether each word was an adjective or not (using word level part-of-speech supervision). Then the causal effect of the adjective PoS on the model output is the difference between these model's outputs. The authors also make use of "control concepts" to check that this adversarial procedure is not removing other information that should be retained for the task at hand, though this requires even more supervision.

      In order to compare against "ground truth" causal effects, the authors evaluate their method on tasks where the causal effect of a feature on a model is easily estimated by generating counterfactual data. For instance, they rely on sentiment analysis data where people's first names can be substituted with one another to change the likely gender or race represented in the example. Here, the ground truth effect on the model output is computed by checking the difference in model outputs for inputs with names substituted for one another. They also evaluate the effect of adjectives as described above, although one confusing thing about this evaluation is that they automatically remove adjectives from sentences in order to generate counterfactual data, but the adjectives themselves are left in the data when training their counterfactual model (only PoS information is eliminated).

      Evaluations show that their method produces causal effect estimates close to the ground truth for each dataset, while baseline methods are highly inaccurate in some cases. Their baseline methods identify counterfactual points based on "passive observation" of features' presence/absence rather than causal intervention. They do not compare to any baselines based on perturbing data points, such as LIME. The authors suggest that Iterative Null Space Projection could serve as an alternative to the expensive counterfactual model training process.

    • arxiv

  • Learning Variational Word Masks to Improve the Interpretability of Neural Text Classifiers
    • 2020
    • The paper proposes to learn text classifiers with a simple masking layer over tokens to reduce the number of tokens the classifier relies on, with the goal of making the model more amenable to applying interpretation methods. The masking layer is a lookup table that returns uncontextualized, token-specific masking probabilities; during training, binary per-token masks are sampled using these probabilities, and backpropagation is performed via the Gumbel-Softmax/Binary-Concrete estimator. Thus the work distinguishes itself from parametric masking layers, e.g. of Bastings (2019). The authors report accuracy improvements typically in the 0.5-1.0 point range on several text classification tasks, relative to unmasked baselines. In a test of local interpretability, they use LIME and SampleShapley to identify important words and find sometimes much higher AOPCs for their model than for baselines (AOPC: a measure of how class scores change when removing "important" tokens). In a test of global interpretability, they measure model robustness to applications of very sparse masks, by selecting globally important words for a task based on their look-up table probabilities in the masking layer. When a small number of words are selected (<10), their model predictions change with less frequency than a similar information-bottleneck approach (Schulz 2020).
    • arxiv
  • How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking
    • 2020
    • This paper introduces a masking method that (1) bottlenecks the information available to a pretrained task model, (2) masks inputs and hidden states to better understand how far information propagates into a model from the input. The objective they posit for the masking model is to find the sparsest mask that yields the same model output as the original model output (relaxed for gradient estimation purposes; Binary Concrete variable used for straight through estimation). A motivating example for their approach is as follows: suppose a model could detect when sequences of numbers have more 8s than 1s in them. How would existing feature importance methods allocate importance for a model prediction on a sequence with two 8s and one 1? When the authors experiment with this kind of problem, they find that other information-bottleneck-style approaches assign unequal importance to the two 8 digits or give importance to irrelevant digits. Moreover, an exact search for the smallest input subset that results in the same model prediction yields just a single 8 (dropping the other 8 and the 1), because this retains the "more 8s than 1s" property. The authors take issue with all of these explanations as they consider each to not reflect how a model must actually reason about sequences as it solves the task. Their proposal is to allow for masks to apply not just at the input but also in later layers. To see why this is useful, they first present a convincing analysis that in a simple two-module model for the toy task, one module decides whether digits are relevant (whether they're an 8 or a 1), and another module then counts and returns the result of the comparison. When they apply their masking approach to this model, the result is that the masks are not applied to the first module, where relevant digits have to be detected, but they are applied to the second module, whether non-8 and non-1 digits may be safely masked out without influencing the model prediction (because they will not be counted). In a general form of this toy task, where a model must decide if there are more m digits than n digits for a given an (m,n) query, they compare the difference in nats between normalized ground-truth importance attributions and those provided by prior approaches. The ground-truth importances attributes are uniform distributions over all the m and n digits in the sequence. This is the ground truth in the sense that each such digit contributes to the fact of whether there are more m than n digits, though not in the sense that each such digit is equally necessary to the binarized prediction of which digit is more numerous. Considering an input token to be masked when their masking model decides the task model's representation at that token index should be masked at any hidden layer in the forward pass, they find that their method essentially does not differ at all from these ground-truth attributions, while all others do by a margin. Next, they scale up to experiments with more complex models on SST (sentiment classification) and SQuAD (question answering). Primarily qualitative analysis follows to compare their explanations with those of past works.
    • EMNLP

Interpreting Representations and Weights (5)

  • Section Highlight: Translating Neuralese
    • 2017

    • This paper proposes a method for translating vector communications between artificial agents into natural language. The motivation here is that, while Deep Communicating Policies (DCPs) are solving multi-agent communication games, it is difficult to analyze the content of agent messages. If we can translate the messages into semantically equivalent natural language statements, that would help us understand how the agents are behaving.

      This translation is made possible by collecting data of humans playing the same communication games as the agents. One game is a simple reference game, where one player describes an image to a second player, and the second player must pick which of two images the first player is describing. The second game is a simplified driving game where two cars must pass through an intersection without colliding when they cannot observe one another directly. So the authors collect data of humans playing these games.

      Translation is predicated on a particular notion of meaning. The authors choose to use a denotational perspective, suggesting that the meaning of a message is represented by the distribution over the speaker's state that is induced in the mind of the listener. This is instead of the pragmatic perspective, which would define meaning by the actions induced in the listener. They next propose a translation algorithm based on finding the minimum KL divergence between two "meaning distributions" induced by two communications. At a high level, the procedure measures the quality of a translation from a model's vector to some natural language utterance in terms of how frequently the utterance induces the same belief distribution over speaker states that the vector communication would (averaged across states and weighted by how likely it is the vector would be used in each state). Then, the best translation for a given message is the argmax of this quality measure (equivalent to the argmin distance between belief distributions induced by the messages).

      Experiments show that this procedure helps models and humans communicate during gameplay and solve the games together at above random rates, even close to model-to-model gameplay without translation. This evaluation is automated by rolling out a human game trajectory and substituting in a model for one of the humans, while the "human" side of the game does not actually listen to the model's messages. To simplify the problem, human utterances in these games are treated as categorical over a set of simple words or phrases that are typically used in the games. Interestingly, the authors find that a "direct translation" baseline, which is simply a model of p(vector|utterance) trained on data from states with both human and model communications, does equally well in terms of allowing model-human pairs to complete the games, but this baseline is much less denotationally accurate, as they find that these communications do not lead listeners to form accurate beliefs about the speaker's state.

    • ACL

  • Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
    • 2018

    • The authors propose a method for finding directions in a neural model's latent space that correspond to human concepts, like stripedness, and study the representations their method identifies for visual human concepts using GoogLeNet and Inception V3. The method itself is quite simple. First, the authors collect a small set of images depicting a concept, e.g. they collect images showing stripes or striped things (around 30 images in some experiments). Then, a linear model is trained to discriminate between the hidden states of the model when these images are passed through it and the hidden states obtained by passing randomly chosen images through the model. The randomly chosen images serve as a contrast set to the set representing the concept of interest. Finally, the Concept Activation vector is defined as the normal vector to the decision boundary, which points in the direction of the concept set's activations. Next, they define a score used for quantitative Testing with CAVs, (TCAV score). This score makes use of a directional derivative of a class score obtained for an image with respect to the CAV for a concept. Specifically, for a given class and a given CAV, the TCAV score is the proportion of images predicted as that class with a positive directional derivative with respect to the CAV. Hence the TCAV score ranges from 0 to 1, depending on what proportion of the images have positive directional derivatives. They intend this to capture how much a concept contributes to a model's prediction for a class (e.g., for images classified as a Zebra, how much did stripedness contribute?)

      Two extensions of the CAV occur throughout the experiments: rather than just getting one CAV with one set of contrastive random images, they get 500 CAVs using 500 random batches of contrastive images. They omit experimental results for when these 500 CAVs do not yield statistically significant TCAV scores, as determined by a t-test with a null hypothesis of TCAV=0.5. Additionally, the contrast sets are sometimes not random images, but instead selected with the intention to better isolate the concept of interest. For example, the contrast set of the concept set of stripedness is the union of images representing dot and mesh textures.

      Armed with the TCAV score, the authors perform a number of qualitative and quantitative experiments. They rank order images by their similarity to a CAV, and display the results. They make Deep Dream style visualizations by maximizing an image's hidden states's similarity to selected CAVs. They validate a number of expected associations between classes and concepts, as well as biases in the models' training data. In one experiment, they aim to validate TCAV scores against ground truth model reliance on a particular concept, by finetuning the models on a data subset where they control the presence of an easy-to-detect feature in the data (inpainted class names as text in the images). With these controlled datasets, they obtain models that either do or do not rely on the easy-to-detect feature in the images, as indicated by their accuracy on test sets without the feature. Then, they compare their ground-truth measure of model reliance on the feature with obtained TCAV scores for the feature CAVs. Showing results for two classes, they find a correlation between a model's reliance on the feature and TCAV scores for the feature. Using these same controlled datasets, they also measure how MTurkers think of feature importance based on gradient-based saliency maps of image predictions. They find that the saliency maps often highlight the easy-to-detect feature even when it is not being used by the model, which seems to mislead the respondents into thinking it was important to the model. Lastly, they obtain domain expert feedback on using TCAVs for a model of diabetic retinopathy, suggesting that "TCAV may be useful for helping experts interpret and fix model errors when they disagree with model predictions."

    • ICML

  • The Building Blocks of Interpretability
    • 2018
    • Distill
    • The authors explain how composing visualization techniques for image recognition neural nets can lead to improved attribution for outputs/activations. They explain three ways of attributing activations; the neuron level, the spatial (i.e. pixel) level, and the channel (i.e. layer) level. These attributions can be combined to result in visualizations that can be traced through the network as a combination of previous layers or neurons, which have been mapped to 2-d images using feature visualization. This is a major step up from previous attribution visualizations, which were often only a heatmap on a specific layer. The authors show that matrix factorization on the flattened matrix of activations can compress the network's behavior and show a succinct visualization for each class or spatial point. Using GoogLeNet, they build a user interface that allows one to explore connections between layers, and suggest that visual interfaces of this kind may be built to help understand layers, atoms (i.e. groupings of neurons), and content (activation or attribution). They end with the caveat that visualization alone may be unreliable, but their compositional method between layers is likely to still hold because it's less subject to path dependence on the specifics of the input.
  • Compositional Explanations of Neurons
    • 2020

    • From Alignment Newsletter #116:

      Network dissection is an interpretability technique introduced in 2017, which uses a dataset of images with dense (i.e. pixel) labels of concepts, objects and textures. The method measures the areas of high activation of specific channels in a convolutional neural network, then compares these areas with the labelled areas in the dataset. If there's a high similarity for a particular channel (measured by the intersection divided by the union of the two areas), then we can say this channel is recognizing or responding to this human-interpretable concept.

      This paper introduces an extension of this idea, where instead of just using the basic concepts (and matching areas in the dataset), they search through logical combinations of concepts (respectively areas) to try and find a compositional concept which matches the channel's activations. For example, a channel might respond to (water OR river) AND NOT blue. This is still a concept humans can understand (bodies of water which aren't blue), but enables us to explain the behaviour of a larger number of neurons than in the original network dissection method. Their work also extends the method to natural language inference (NLI), and they interpret neurons in the penultimate layer of a BiLSTM-based network trained to know whether a sentence entails, contradicts, or is neutral with respect to another. Here they create their own features based on words, lexical similarity between the two sentences, and part-of-speech tags.

      Using their method, they find that channels in image classifiers do learn compositional concepts that seem useful. Some of these concepts are semantically coherent (i.e. the example above), and some seem to have multiple unrelated concepts entangled together (i.e. operating room OR castle OR bathroom). In the NLI network, they see that many neurons seem to learn shallow heuristics based on bias in the dataset - i.e. the appearance of single words (like nobody) which are highly informative about the classification.

      Finally, they use their method to create copy-paste adversarial examples (like in Activation Atlas (AN #49)). In the Places365 dataset (where the goal is to classify places), they can crudely add images which appear in compositional concepts aligned with highly contributing neurons, to make that neuron fire more, and hence change the classification. Some of these examples generalise across classifier architectures, implying a bias present in the dataset.

    • arxiv

  • LCA: Loss Change Allocation for Neural Network Training
    • 2021

    • From Alignment Newsletter #98:

      This paper introduces the Loss Change Allocation (LCA) method. The method's purpose is to gain insight and understanding into the training process of deep neural networks. The method calculates an allocation of the change in overall loss (on the whole training set) between every parameter at each training iteration, which is iteratively refined until the approximation error is less than 1% overall. This loss change allocation can be either positive or negative; if it's negative, then the parameter is said to have helped training at that iteration, and if it's positive then the parameter hurt training. Given this measurement is per-parameter and per-iteration, it can be aggregated to per-layer LCA, or any other summation over parameters and training iterations.

      The authors use the method to gain a number of insights into the training process of several small neural networks (trained on MNIST and CIFAR-10).

      First, they validate that learning is very noisy, with on average only half of the parameters helping at each iteration. The distribution is heavier-tailed than a normal distribution, and is fairly symmetrical. However, parameters tend to alternate between helping and hurting, and each parameter only tends to help approximately 50% of the time.

      Second, they look at the LCA aggregated per-layer, summed over the entire training process, and show that in the CIFAR ResNet model the first and last layers hurt overall (i.e. have positive LCA). In an attempt to remedy this and understand the causes, the authors try freezing these layers, or reducing their learning rate. The first layer can't be fixed (freezing makes its LCA 0, but later layers' LCA is increased in turn so the overall final loss stays the same). However, for the last layer, freezing or reducing the learning rate increases the overall performance of the network, as the last layer's LCA is decreased more than all the other layer's LCAs are increased. They also hypothesize that by reducing the momentum for the last layer, they can give it fresher information and make it more likely to learn. They find that this does work, though in this setting previous layers’ LCA increases to compensate, leaving overall performance unchanged.

      Finally, the authors show that learning seems to be synchronised across layers; layers get local LCA minima at the same training iterations, in a statistically significant way. They show this must be a combination of parameter motion and the gradient, as neither on their own explains this phenomenon.

    • arxiv

Generating Counterfactuals and Recourse Procedures (4)

  • Section Highlight: Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations
    • 2020

    • This paper introduces a method for generating Diverse Counterfactual Explanations (DiCE) of binary classification models of tabular data. The primary motivation is recourse, i.e. **giving feasible steps for how to achieve a desired outcome when a model makes an unsatisfactory decision about them. **Use cases include offering recourse for credit risk assessment and loan approval decisions. They run experiments with datasets for these tasks.

      Given an original data point and a desired model output, their goal is to identify a set of data points that are (1) valid, meaning the model outputs the desired class, (2) proximate, meaning close to the original data point, (3) diverse, meaning they represent a variety of paths to achieving the desired outcome, (4) sparse, meaning few changes from the original data point are proposed, and (5) feasible, meaning they follow causal laws of the data generating process. To solve this problem, they formulate the first three objectives (validity, proximity, diversity) in differentiable terms, and perform a gradient-based optimization to identify the set of counterfactual points. Sparsity is encouraged in a post-processing step where, for a candidate counterfactual point obtained from the optimization, changes from the original point are greedily selected until the counterfactual achieves the desired model output. The causal feasibility is enforced in another simple filtering step, where candidates are filtered out based on violations of simple user-specified causal principles (e.g., if education increases, age must also increase; education cannot decrease).

      For neural models of the COMPAS recidivism dataset, an income-prediction dataset, and the credit risk assessment and lending approval datasets, they find that their approach successfully identifies sets of counterfactuals that reasonably satisfy automatic metrics for validity, diversity, and proximity, and they present qualitative examples. They also aim to measure how these kinds of explanations, consisting of one original datapoint and a set of counterfactuals, can communicate a model's local decision boundary. As proxies for humans trying to reason about a local decision boundary given an explanation, they fit a simple nearest neighbor model to one explanation at a time and evaluate how that model classifies data in a radius around the original data point. That is, they sample datapoints in a sphere around the original data point, and compute the accuracy of a 1-nearest-neighbor model that has look-up access to only the data points in the explanation. In this set-up, they find the simple 1-NN model achieves up to a 44 F1 score with k=4 counterfactuals. They suggest that examples from their method can "approximate the local decision boundary at least as well as local explanation methods like LIME."

    • ACM FAT

  • Counterfactual Visual Explanations
    • 2019
    • In this paper, the authors propose a method of generating counterfactual explanations for image models. A counterfactual explanation in this framework is a part of the input image that, if changed, would lead to a different class prediction. The authors formalize the minimum-edit counterfactual problem which is defined to be the smallest number of replacements between an input I (which the model classifies as label A) and another input I' (which the model classifies as label B) such that the model will predict class B for the newly edited input I. The actual edit is done by permuting I' and then replacing a subset of I with values from the permuted I'. Because the space is so large to solve this problem exactly, the authors present two greedy relaxations of the problem. The first method is to iteratively look for the single edit which leads to the largest increase in log probability between the original and subsequent class predictions for class B. The second method is to, instead of taking a direct subset of I' values (which was done via the Hadamard product of a binary vector with I'), allow it to be a point on the simplex of a distribution over all features in I'. Then, both the permutation and the subset coefficients are learned via gradient descent. These explanations are used on four datasets: SHAPES, MNIST, Omniglot, and Caltech-UCSD Birds (CUB). In all four cases, the explanation is generated from the last layer of the CNN used. The authors evaluate the explanations qualitatively by examining which regions from I and I' are permuted to form the new counterfactual image. In the shown examples, the counterfactual images are constructed via appropriate portions of I', for example a "1" from MNIST incorporating another spoke from a "4" to look more like it. The authors also evaluate the average number of edits needed to change the class label. The authors then used the counterfactual explanations from the CUB dataset to set up a training task where graduate students were tasked with learning how to classify images into one of two classes (which is not a trivial task). When participants got a choice wrong in the training phase, they were shown a counterfactual image. Their performance on the test phase was compared to two other baselines: students where were given no example (only right/wrong) during training and students who were shown a GradCAM heatmap during training. The counterfactual image group had the highest accuracy, but this was not significant at the 90% confidence level against either baseline.
    • ICML
  • Explanation by Progressive Exaggeration
    • 2020
    • The authors "propose a method that explains the outcome of a classification black-box by gradually exaggerating the semantic effect of a given class." A resulting explanation is a series of altered images shifting from one class to another. Their method uses GANs as the underlying model for the generation of images; at each step, they make a change such that the model's probability of the desired class increases from the previous step. The authors run six experiments using two types of data: human faces and X-rays. Their evaluations include: qualitative analysis of explanations including identifying model biases / conflation of features, checking that statistics of altered images match those of real images receiving the same model output, and the effect on accuracy of corrupting pixels identified by their method as "important" to a class. They also run human studies where they see if MTurkers can identify the target attribute being explained based on the explanations; participant accuracy was from around 77% to 93% depending on the difficulty of the task.
    • ICLR
  • Counterfactual Explanations for Machine Learning on Multivariate Time Series Data
    • 2020
    • This paper considers the problem of finding counterfactuals for multivariate time series data, and returns explanations of the form: if feature X was not decreasing over time, this sequence would not be classified as Y. They formulate the problem as optimizing a model's score for a selected class for a particular data point, while substituting out entire feature trajectories in the data point (entire rows in a input), with substitutions being drawn from observed trajectories in the training data. They present an algorithm for this problem and evaluate their method using "three HPC system telemetry data sets and a motion classification data set." Their quantitative evaluations correspond with four principles: explanations should be (1) faithful to the original model, (2) comprehensible to human operators, (3) robust to small changes in the data point, and (4) generalizable to similar data points. They also qualitatively analyze explanations of correctly and incorrectly classified examples. For faithfulness in particular, they fit sparse logistic regression models, and evaluate their method against LIME and SHAP according to how well the "important" features selected by each method match the ground-truth used features in the logistic regression models. Both their method and SHAP obtain a precision of 1 and recalls of between .15 and .5, but they note that method usually returns 1-3 important features while SHAP usually returns over 100.
    • arxiv

Explanation by Examples, Exemplars, and Prototypes (4)

  • Section Highlight: This Looks Like That: Deep Learning for Interpretable Image Recognition
    • 2019

    • The goal of this paper is to get image recognition models to imitate one way that humans can classify images, which is by comparing parts of a new image to prototypical parts of known images. The authors suggest that a model that can do this is "interpretable, in the sense that it has a transparent reasoning process when making predictions." They propose a model for this purpose with the following structure: A CNN maps an image to a representation of shape , which can be thought of as a set of vectors in . For every class, there are 10 vectors in that are prototype vectors for that class. For each prototype, a prototype activation is obtained by a function of the distance between that prototype and the nearest vector given by the CNN output, such that nearer vectors yield larger activations. A logit for each class is computed by a weighted sum of the prototype activations. By enforcing that prototype vectors correspond to some vector representation obtained from an actual training data point, they attain a model that makes predictions based on similarities in representations between a current data point and training data. To interpret how a model predicts a given class, the authors interpret the activations between an image's CNN output and the 10 prototype vectors for that class. The model is trained end-to-end with a step that constrains prototype vectors to be equal to representations from training data points.

      The model interpretations are visualized by localizing the part of the query image that highly activates each prototype, as well as the region in the prototype's training image that it represents. For a given prototype, heat maps over the query image are generated by taking the activation score of that prototype with all vectors in an image's CNN output, then upsampling the resulting activation grid to match the original image size. Finally, a bounding box is obtained by selecting the small rectangle containing the top 5% of upsampled activation scores (at the pixel level).

      With models trained for fine-grained image classification using the CUB dataset, the authors present an abundance of qualitative analysis of the model, including examples of classified data points and analysis of the latent space structure. The common pattern in the presented figures is that the image parts that most activate a given prototype reflect similar characteristics of the image: a bird with red feathers on its belly activates a certain prototype, and when bounding boxes are overlaid on the original image and the image from which the prototype comes, the bounding boxes cover the red bellies on the new image and the image the prototype came from. Another prototype might capture the bird feet, another the bird's eye, etc., each showing a bird in the training data where the prototypical part representation comes from. The authors note that, across models, a maximum drop of 3.5 percentage points in accuracy is observed between the prototype model and a blackbox model trained with the same CNN feature extractor. This drop in accuracy can be mostly ameliorated by ensembling several prototype models (though this does increase the parameter count and number of prototypes per class).

    • NeurIPS

  • Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
    • 2018
    • In this paper, the authors explain a new machine learning model, Deep K-Nearest Neighbors (DkNN). The DkNN model takes as input a trained neural net, a number of neighbors k, and an input. Each point in the training set has its intermediate layer-wise results (when passed through the neural net) recorded. Then, during evaluation, the DkNN uses locality sensitive hashing to find the set of k neighbors in each layer's latent space whose output is closest to the input. The authors then calculate the noncomformity of the (input, label) pair, where noncomformity is defined to be the number of values in the set of neighbors whose label does not agree with the output label. They also calculate an empirical distribution of noncomformity scores which are derived from a separate calibration set which comes from the same distribution of the training set. The DkNN then computes a probability for each class label defined to be the proportion of empirical noncomformity scores larger than the current label's. For a given input, the authors then define the model's confidence to be 1 minus the second largest class probability and the credibility to be the largest class probability. The authors evaluate their model on MNIST, the SVHN house number dataset, and the GTSRB street signs dataset. In all three tasks, k is set to 75. The authors show that the DkNN outputs a lower average credibility for the notMNIST dataset, showing that for out-of-distribution samples, their model is better calibrated than the naive softmax probability. On adversarial examples for all three datasets, the authors show that the average accuracy of the DkNN is higher than the normal DNN, across three types of attacks (Fast Gradient Sign Method, Basic Iterative Method, and Carlini-Wagner L2 attack).
    • arxiv
  • Interpretable Image Recognition with Hierarchical Prototypes
    • 2019
    • This paper proposes to use a taxonomic organization of classes with a prototype-based vision model, so that explanations are given for every taxonomic level of classification. The paper also integrates existing novel class detection methods to work within the hierarchical class structure. As an example, the goal is that, when the only kinds of weapons a model has seen during training are rifles and assault rifles, the model could classify a handgun as a novel object, then classify it as a weapon on the basis of similarities between the handgun and weapon prototypes. The model interpretation is done by showing heat maps over images that represent which patches of an image would yield representations closest to class prototype representations. After training models on a subset of ImageNet with a hand-defined taxonomy, the analysis is primarily qualitative, focusing on case studies of prototype representations and novel class identification. One quantitative analysis of the latent space shows that the nearest neighbors of class prototype representations are also members of those prototype's classes about 80% of the time (though the model achieves similar accuracy as a blackbox model).
    • AAAI-HCOMP
  • A Generic and Model-Agnostic Exemplar Synthetization Framework for Explainable AI
    • 2020
    • In this paper, the authors propose a method of generating examples of different classes when given a black-box model. The method consists of access to a black-box model C and a generative model G. The authors use an evolutionary algorithm that allows them to efficiently generate exemplars to represent each class. At the start, the algorithm generates a population of t exemplars. At each iteration, the top k exemplars are kept, defined to be the top k inputs with the lowest squared difference between C's output and the desired class label y. Then, each exemplar is duplicated and mutated by adding a zero-centered Gaussian noise vector. The authors also add a momentum term which is a multiple of the previous vector added in the last iteration. This continues until the loss is under the preset threshold. The authors show that adding momentum leads to 19% faster convergence compared to the naive method. This exemplar method is tested on three datasets: the Adult Data Set for income prediction, Facial Expression Recognition 2013, and the Large Movie Review Dataset. The results are subjectively evaluated and the authors claim that their results are qualitatively more understandable than a direct gradient descent approach for creating exemplars.
    • arxiv

Finding Influential Training Data (2)

  • Section Highlight: Understanding Black-box Predictions via Influence Functions
    • 2017
    • The authors apply influence functions, a notion from robust statistics, to machine learning. An influence function asks how a model's parameters would change as we upweight specific training data. We can then use this to approximately answer the question of how the model changes if a specific example were not in the training set. The authors give stochastic solutions to solve for the influence function. As the influence function is only an approximation, the authors conducted actual leave-one-out training for the MNIST dataset and compared the resulting model parameters with the influence function results; they matched well. They show that even in a non-convex setting with a CNN, their influence function agrees well with the actual leave-one-out procedure. The authors also use influence functions to craft adversarial training examples, i.e. perturbed training examples that lead to misclassification of test data. They also provide a few additional uses for influence functions like identifying mislabeled training data.
    • ICML
  • Estimating Train Data Influence By Tracking Gradient Descent
    • 2020

    • From Alignment Newsletter #97:

      This paper presents the TrackIn method for tracking the influence of training datapoints on the loss on a test datapoint. The purpose of the method is to discover influential training points for decisions made on the testing set. This is defined (loosely) for a training point x and test point z as the total change in loss on z caused by training on x. They present several approximations and methods for calculating this quantity efficiently, allowing them to scale their method to ResNet 50 models trained on ImageNet.

      The standard method of evaluation for these kinds of methods is finding mislabelled examples in the training dataset. Mislabelled examples are likely to have a strong positive influence on their own loss (strong as they're outliers, and positive as they'll reduce their own loss). Sorting the training dataset in decreasing order of this self-influence, we should hence expect to see more mislabelled examples at the beginning of the list. We can measure what proportion of mislabelled examples is present in each different initial segments of the list. The authors perform this experiment on CiFAR, first training a model to convergence, and then mislabelling 10% of the training set as the next highest predicted class, and then retraining a new model on which TrackIn is run. When compared to the two previous methods from the literature (Influence Functions and Representer Points), TrackIn recovers more than 80% of the mislabelled data in the first 20% of the ranking, whereas the other methods recover less than 50% at the same point. For all segments TrackIn does significantly better.

      They demonstrate the method on a variety of domains, including NLP tasks and vision tasks. The influential examples found seem reasonable, but there's no quantification of these results.

    • arxiv

Natural Language Explanations (8)

  • Section Highlight: Joint Mind Modeling for Explanation Generation in Complex Human-Robot Collaborative Tasks
    • 2020

    • The authors develop a virtual cooking game as a testbed for a proposed method for robot-human collaboration. The core of their approach is that, in addition to online planning of its own actions, the robot maintains a mental model of the human's current plan. Whenever the robot thinks the human's plan deviates from its own by a certain amount, the robot sends a message to the human including: (a) its own goal, (b) the action it will next take, (c) and the outcome of that action, as well as (d) its expectation of the human's goal, (e) the action it thinks the person will take, and (f) the outcome of that action. The message comes in the form of schematic natural language, with variable/action names substituted in. The plans themselves are given by parsing a Spatial, Temporal, and Causal And-or Graph (STC-AoG), which represents goals, subgoals, and atomic actions for achieving subgoals. The authors present an algorithm for inferring human plans that "uses observed user actions and communication history to infer human mental state."

      They recruit study participants from their university subject pool for a test of the robot and cooking game. They consider three conditions: one with no communication, a "heuristic" condition with communication every 9.3 seconds (based on a pre-study's frequency of people asking for help), and one with mental-model-based communication. Participants completed their first task in about 75 seconds on average in the mental-model condition, versus 150 in the control (p<.05 on difference; 125s in the heuristic condition, p=.12). On a 7 point "helpfulness" scale, the mental-model condition was rated got around 5.5 on average, versus 4 in the control and 3.5 in the heuristic (p<.05 for comparisons). In participants' second round of the game, times converged across conditions, presumably due to familiarity with the game. The authors note that disparities in plans may arise from differing subgoals, or misunderstandings about action preconditions or effects (on the human's end).

    • arxiv

  • Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
    • 2018
    • The authors collect human explanations of data points in two existing tasks, one for visual question answering (VQA-X) and one for visual activity recognition (ACT-X), and they propose multi-model explanation frameworks for performing both visual feature importance estimation and free form textual explanation generation. Textual explanations are generated by a neural model conditioning on the input, i.e. the image and for VQA the question, as well as the task model's predicted label, making the generations rationalizing explanations. Textual explanations are compared with the collected ground-truth explanations: BLEU scores with the ground-truth are 19.8 and 24.5 for the two datasets, while a human evaluation with MTurkers results in 38% and 45% of Turkers rating the generations as "better than" or "equivalent to" the ground-truth (for two datasets). "Important" image regions are compared with ground-truth human-annotated regions by the Earth Mover's Distance and a correlation statistic. Besides offering qualitative analysis, the last evaluation is a failure prediction experiment, where humans are shown explanations for data points (but not model predictions) and predict whether the model's prediction was correct. Here, humans get 70% and 80.5% accuracy, when random performance would yield 50%.
    • CVPR
  • Textual Explanations for Self-Driving Vehicles
    • 2018
    • This paper collects textual descriptions and explanations of dashcam video of human driving, then proposes generative models of textual explanations for the behavior of a "driving" model. This driving model, or controller, uses a CNN to produce features for each video frame, then uses another neural module to output an accelerate and direction-change (learned with human accelerations and direction-changes). There are a few variants of generative models: introspective models condition on the visual features from the CNN, with spatial attention either "strongly" or "weakly" aligned with the controller's spatial attention (so they use roughly the same visual representations). A rationalizing model is free to attend over visual features as it pleases. All models condition explicitly on the controller outputs. In this sense they all rationalize in the usual use of the word, but rationalize is the least tied to the controller's internal states. BLEU and other metrics are used for an automatic evaluation: BLEU scores with human explanations are around 7 across conditions. (The models also generate descriptions of the controller actions; their BLEU is about 32.) The human evaluation for explanations is to ask MTurkers if they are "correct" and take a majority vote among three Turkers: 62-66% of the explanations are "correct." (Descriptions of controller actions are rated correct 90-94% of the time.) There are no statistical tests for the differences in conditions.
    • ECCV
  • e-SNLI: Natural Language Inference with Natural Language Explanations
    • 2018
    • The authors get humans to annotate the popular SNLI dataset with natural language explanations of why each data point should have the label it has. The result is about 570,000 human explanations. They train an LSTM-based model to both perform NLI and generate explanations for its outputs. Explanations are generated conditional on the input representation for the task and the output label given by the model. A subset of explanations for correct predictions is evaluated manually by the authors according to their "correctness," i.e. whether or not they stated the "arguments" that made a data point's label the true label. Their best model produced correct explanations 64% of the time at a cost of 2.3 ppts lower accuracy relative to a non-explaining baseline.
    • NeurIPS
  • Explain Yourself! Leveraging Language Models for Commonsense Reasoning
    • 2019
    • In this paper, human-annotated explanations are collected for the Commonsense Question Answering (CQA) dataset, which is a multiple-choice task with 7610 train and 950 test points (for v1.0). The annotations include text highlights of important parts of the question, and natural language explanations of why the correct answer choice is correct. The authors propose two modeling procedures for generating explanations: a reasoning procedure, that generates explanations from questions and answer sets, and a rationalizing procedure, which also conditions on a label (the ground-truth labels during training, and predicted labels at test time). Using a fine-tuned GPT as the generator yielded a BLEU score of 4.1 in the reasoning condition. An approach termed CAGE appends a BERT model to the GPT-reasoning pipeline to predict task outputs conditioned on inputs and generated explanations. The authors report accuracy gains from this pipeline over a BERT baseline (though a similar approach applied to the larger e-SNLI dataset does not yield any changes in accuracy). Their human evaluation is to ask MTurkers to "guess the most appropriate answer choice based on only the explanation without the question." They find that BERT outputs are recoverable from the GPT explanations 42% of the time (random: 33%), while ground truth labels are recoverable from human explanations 52% of the time.
    • ACL
  • Towards Prediction Explainability through Sparse Communication
    • 2020

    • This paper assesses extractive explanations for models of textual data under a simulatability perspective, and they present explanation generation in terms of communication between an explainer and a listener (Layperson, as the paper puts it). The purpose of the explanation is to simply encode the model's output. An extractive explanation is a set of words pulled out of an input, and the procedures for generating these explanations follow a general format: rank words, then pull out the top k. What they are evaluating is the success of a procedure at encoding a label into a bag of words from a particular input such that a listener can extract the label from the code.

      The ranking methods they consider include a classification model's attention weights (using various kinds of attention), a gradient-based saliency ranking, and a standard word omission procedure. The listener is either a human or a trained BoW model. They also consider jointly training the BoW listener along with an explanation model, , which extracts words from the input conditioned on the classifier's output and its final hidden state. Their automatic evaluation is to compute the listener BoW model's accuracy at predicting a classifier's output (Communication Success Rate, CSR). Their human evaluation is to use people as explainers, listeners, or both. With IMDB sentiment and SNLI natural language inference data, they evaluate CSR with all four combinations of human and machine listeners and speakers. They find that jointly trained machine explainers and listeners complete the task with 99%+ accuracy for both datasets. There is little to no statistical difference among the various attention-based top-k methods with human listeners: on sentiment analysis, CSR ranges from 87.5% to 93.25% (random: 50%); on NLI, it ranges from 70.5 to 74.5% (random: 50% — no neutral label). Humans successfully communicate the label 86.5% of the time on NLI.

    • arxiv

  • WT5?! Training Text-to-Text Models to Explain their Predictions
    • 2020
    • The authors train the 11 billion parameter T5 model in a multi-task framework to do a task and generate natural language explanations for its answers on the task. The explanations are either free form (abstractive) generations or important words from the model input (extractive). Learning is entirely supervised using human-provided explanations (either free form explanations or text highlights). Experiments for the open-ended explanations are conducted with e-SNLI and CoS-e datasets (see papers above). For extractive explanations, a sentiment analysis task (Movie Reviews) and a passage comprehension task (MultiRC) are used. An evaluation for plausibility is done via BLEU for open-ended and F1 score for extractive explanations, and they far exceed the previous SOTA. The human evaluation is to show MTurkers predicted data points with explanations and ask them, "Does the explanation adequately explain the answer to the question" (for CQA) or the same question with task-appropriate wording for other datasets. After taking the majority vote of 5 Turkers for 100 data points, they report "correct"-ness rates for model-generated explanations and the ground-truth human explanations in each dataset (random is 50% for each): the model gets 90% for NLI (humans: 78%), 30% for CQA (humans: 16%), 94% for sentiment analysis (humans: 99%), and 50% for MultiRC (humans: 51%). The authors write: "To summarize, our results suggest that WT5-11B is at a human or super-human level at both classifying and explaining examples from the datasets we considered." The authors conduct additional analysis concerning out-of-domain and out-of-task transfer, as well as sample efficiency (only having 100 e-SNLI explanations will get you 28 BLEU with T5-11B; 50k gets you the full-scale result of ~34).
    • arxiv
  • Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?
    • 2020

    • The key question the paper addresses is how to evaluate natural language explanations generated by a model. Past works have done so primarily by training a model for particular tasks, using human explanations for data points as supervision for explanation generation, then comparing explanations generated for the model decisions to the “ground-truth” explanations of the data point labels provided by humans. The authors argue that this is problematic because it suggests that explanations are evaluated according to their plausibility rather than faithfulness to the model’s internal reasoning. In effect, past works rated methods highly when they gave explanations of their behavior that sounded good, even if they did not actually reflect the true reasons that led the model to its decision. To resolve this shortcoming, they present a procedure for automatically measuring how well explanations would let an “observer” predict a model’s behavior (in a similar spirit to other work on model simulatability). For this procedure to capture the meaning in explanations and avoid rewarding trivial explanations that just restate the model behavior (e.g. “I picked answer A because it was the best choice”), a causal inference method is used to control for the influence of explanation “triviality” on the observer. In their experiments, the observer is a model proxy for a human.

      Using two existings text datasets with human explanations, eSNLI and CoS-E, they evaluate existing methods and newly presented approaches, finding that some methods may not produce helpful explanations on average, while the most successful explanations come from a model that first generates hypothetical explanations for every answer choice, then makes predictions based on the input and all hypothetical explanations (and selects the final explanation based on the prediction). Further, since a metric for explanation quality is proposed, they also carry out experiments where models are optimized for this metric directly (or, rather, a proxy for this metric). These experiments are interpreted as multi-agent communication games played in natural language, and they find that in some settings this can improve the explanations' effect on model simulatability.

    • Findings of EMNLP

Developing More Easily Interpreted Models (6)

  • Section Highlight: Human-in-the-Loop Interpretability Prior
    • 2018

    • This paper proposes an algorithm that minimizes the number of user studies needed to identify a model that is both accurate and interpretable according to some study-based criterion for interpretability. In their terminology, they query a human "prior" on model interpretability, , via a single user study, and their algorithm solves for the MAP solution, , by using sequential user studies to search over models in .

      The procedure is as follows: First, they obtain a set of high likelihood models, i.e. models that explain the data well. Based on the view that model users might want a model to meet a minimum accuracy threshold, but not care much about exceeding this threshold, they define a likelihood function that sharply increases after a model meets an accuracy threshold. In practice, they have no preference between models as long as they meet the accuracy threshold. Second, they perform a user study that yields a single human-interpretability-score (HIS) for a given model and subset of data. They use the domain-general notion of simulatability in their studies, computed here as the mean response time (RT), which is the time it takes a user to predict a model output for a given input when relying on a summary/explanation of the model (elaborated on in next paragraph). For a higher HIS scores to represent more interpretable models, they compute the final HIS as the difference between a maximum allowed response time and the actual response time. The actual optimization occurs by an upper-confidence-bound based search over models, with UCBs given by a Gaussian Process over . The kernel on models is the RBF kernel over feature importance estimates obtained for each model (by gradient-based feature importance for NNs, and some other procedure for decision trees). Starting with an initial model from the set of acceptably accurate models, the next user study is conducted using the model with the highest UCB estimate for its HIS.

      The summary/explanation of a model is, for decision trees, given to users as a literal visualization of the tree. For neural networks, the summary/explanation is given by a locally estimated decision tree for each study data point. Similar to LIME, they sample from a perturbation around a data point and fit a decision tree to the NN's outputs on that perturbation distribution.

      In their study, they run this algorithm for 10 iterations (ending with the 11th model), using graduate ML students and four tabular datasets (one synthetic). Before using graduate students, they ran a study with MTurkers, but found that the results were too noisy. They train decision trees for the three simpler datasets and use a neural network for the most complex dataset. They also compare to models obtained by optimizing for four easily computable heuristics for interpretability (in decision trees): number of nodes, mean path length, number of used features, and mean number of features in a path.

      They find that: (1) while optimizing for a given interpretability heuristic (like number of decision tree nodes) often produces models that perform poorly under another interpretability heuristic, optimizing for their response-time HIS score produces models that tend to do well across interpretability heuristics, (2) under each of the four interpretability heuristics, their optimization algorithm finds more interpretable models more quickly than a random search over models, (3) their optimization does seem to increase HIS scores over time, which is the actual objective in the optimization. Regarding this last point, note that they "did not see statistically significant differences in our results," though their experiments may have been underpowered since they ran with 16 subjects divided across conditions.

    • NeurIPS

  • Learning Certifiably Optimal Rule Lists for Categorical Data
    • 2018

    • The paper introduces an approach called CORELS that finds rule lists for data with a categorical feature space, where the rule lists are guaranteed to be optimal under the learning objective. Here, a rule list is a list of if-else statements (ending in just an if statement) of the form "if feature J == category C, predict Y, else..." and which serves as a classifier. The objective function is the empirical risk, with regularization on the length of the rule list (i.e. number of rules). The regularization penalty is directly interpretable as the trade-off between gaining p% of model accuracy at the cost of adding p rules to the model. The model is identified through a proposed branch-and-bound algorithm, which relies on a number of key observations that drastically reduce the size of the space of rule lists that needs to be searched.

      Experiments involve several publicly available datasets for high-stakes tasks, including recidivism prediction and weapon possession in stop-and-frisk searches. Across tasks, the data points include 3-7 categorical attributes and up to 28 binary features.

      The learned rule lists are 4 or 5 rules long, meaning it is very easy to read the entire rule list and see how it will handle every data point. The authors observe that on the recividism data, their approach achieves equal accuracy to a proprietary, blackbox "prediction tool" (COMPAS) used for recidivism prediction in some places in the US legal system.

      Lastly, the authors note that the search algorithm may struggle with very high dimensional data where many possibly relevant features are highly correlated. They also suggest that, if desired, predicted probabilities can be obtained by taking the empirical probability of the predicted outcome for a given rule, and they remind the reader that their approach is not to be used naively for causal inference.

    • JMLR

  • Faithful and Customizable Explanations of Black Box Models
    • 2019
    • The authors give a global interpretability method for black-box machine learning models which approximates a model with a two-level decision set. This is a model that separates the inputs through feature predicates (e.g. is age > 30?) and then another set of predicates (hence the two-level structure) for assigning a class label. The authors construct an optimization objective that accounts for the two-level decision set's fidelity (agreement with the original model), unambiguity (lack of overlapping predicates), and interpretability (the number of predicates used). They use an optimization method based on approximate local search to solve this objective. This method has the benefit of allowing user input over which predicates they wish to use. The method is evaluated on a bail outcome dataset with 86k people, a student outcomes dataset with 21k people, and a depression diagnosis dataset with 33k people. The authors compare their model with other interpretable approximation methods like LIME, Interpretable Decision Sets (IDS), and Bayesian Decision Lists (BDL). The authors examine the fidelity interpretability trade-off. Out of all methods tested, the authors find that MUSE performs the best in terms of fidelity per number of rules and fidelity per average number of predicates. In a 33 participant user study, the authors find that when given the approximate model and asked questions about how the model would respond to a particular input, the MUSE model led to better accuracy and lower response time, when compared to IDS and BDL.
    • AIES
  • NBDT: Neural-Backed Decision Trees
    • 2020
    • The authors show how to modify a neural net into a more interpretable model by converting the last layer's weights into a decision tree, where each leaf node corresponds to a row of the weight matrix, and higher up nodes are averages of the nodes below. They add an additional regularization term during training for the cross-entropy loss between the corresponding node and the correct label. This is intended to improve the separation of representations for each node so each leaf node can be associated with a class. The overall model is evaluated by running the input through the neural net until the penultimate layer, whereupon it is then fed into the decision tree. The authors use this method of construction on a variety of models (ResNet, WideResNet, EfficientNet) on a variety of datasets (CIFAR10, CIFAR100, TinyImageNet, ImageNet). On all tasks, the NBDT is competitive, scoring within 1% of state-of-the-art performance. The interpretability of the NBDT's nodes is developed through either of two post-hoc iterative processes. The first approach involves first hypothesizing a category a node corresponds to, and then visually checking this hypothesis with a "representative" sample, defined to be a data point with an embedding similar to the node's. The second approach involves postulating categories for the nodes, and then training on a held-out dataset; these data points are then checked to see if they are passed to the child whose category is most fitting for them.
    • arxiv
  • Interpretable Learning-to-Rank with Generalized Additive Models
    • 2020

    • The authors propose to use a generalized additive model for a learning-to-rank task and make a few arguments that this kind of model is more interpretable than blackbox models for the task. A GAM takes the form , where a feature-specific function is applied to each feature of a data point separately. The model trades off being able to model feature interactions with the ease of understanding a particular feature's contribution to the final output: since each function is univariate, they can simply be plotted across the range of the input feature. In a learning-to-rank task, a given model input consists of a set of data points along with some general context information for the set, . The authors extend the standard GAM to operate in the ranking setting by weighting the contribution of each feature with a weight obtained by another GAM on , as such: where is the element of a vector-valued GAM . Each is a separate neural network, as is .

      They compare their model with an existing tree-based GAM, and they compare across choice of loss function, between MSE and the ranking loss NDCG. They run experiments with three tabular datasets. In terms of NDCG performance, their neural GAM with a ranking loss performs the best, by between 0.3 and 4 points.

      To evaluate interpretability, they consider three approaches. First, they check that their individual feature models seem to represent feature importance in the same sense that a standard feature-shuffling based approach (measured by change in NDCG from shuffling a feature column in the data). Specifically, they compute the difference in 5th and 95th percentile values (for corresponding two data points) as the "range" of the feature function, then check the correlation between these range values and the feature importance values obtained by shuffling. They find that using the ranking loss is important to ensure that there is a correlation between the range and feature-importance at all (but do not compare tree and neural GAMs). Second, they plot individual feature functions for the tree GAM and neural GAM. With both models, you can see how a feature contributes to the overall model output. Third, they plot a heatmap of values across a one-hot representation of a categorical feature, to see how, in this case, a region/country context feature yields weights for several features; they observe some interesting structure in which some regions yield similar weights for certain features.

      In two final experiments, they show how distilling the submodels into piece-wise linear models after training results in a 20x speed-up to inference at a small (1 point) cost to the objective. And they find that building a blackbox neural net on top of GAM features yields better NDCG by about 1 point over simply training a neural net from scratch. Their neural GAM performs about 2 points worse than the neural net trained from scratch.

    • arxiv

  • Obtaining Faithful Interpretations from Compositional Neural Networks
    • 2020

    • This paper proposes to use neural module networks (NMNs) for purposes of increasing model interpretability. An NMN works by first parsing an input into a program, which is a series of functions to compute, then executing this program. For example, when a model must decide if a caption matches an image, an NMN might parse caption into the program "1. find dogs, 2. filter for black dogs, 3. check if number of entities from (1) and (2) are equal." These steps are executed by neural modules, and existing architectures and training procedures allow for NMNs to be trained end-to-end with supervision only at their final output.

      NMNs could improve model interpretability over standard blackbox models by virtue of their modules executing human-interpretable functions, like finding, filtering, counting, comparing, etc. But it's a known result that training NMNs end-to-end leads to module "responsibilities" being diffusely distributed across several models, with some modules performing unintended functions. The authors confirm this finding for a visual+textual reasoning task (similar to the image captioning described above) and a purely textual reasoning task.

      Principally, the authors study how to improve NMN "faithfulness," which they define as a property obtaining of models whose modules perform the roles they are intended for. They do so by providing programs to models (either gold programs or heuristically obtained programs) and collecting labels from people for what intermediate module outputs should be. To measure faithfulness, they measure the discrepancy between the module outputs and ground-truth outputs, for the provided programs.

      They identify a few ways to improve faithfulness, usually at the expense of model accuracy. On visual+textual reasoning: By specifying modules exactly (i.e. simply summing the module inputs for a 'sum' module instead of fitting free parameters), they improve faithfulness slightly at the cost of 3 points of accuracy. Using inflexible, few-parameter modules further improves faithfuless over manual specification, at a smaller cost to accuracy. On textual reasoning, they find ways to decrease faithfulness. By training without two helpful kinds of modules (sorting and comparison modules), and thereby offloading these responsibilities onto other modules, they lower model faithfulness. And they observe that adding blackbox modules which can perform multiple "steps" at once can decrease overall model faithfulness.

      For both reasoning tasks, they find that training on gold labels for module outputs greatly improves faithfulness. Interestingly, in one setting, this is actually pretraining on another dataset that has labels for module outputs rather than the dataset they evaluate with. The authors are hopeful that with careful architecture design and additional supervision, NMNs can provide more interpretability than standard models with a low cost to performance.

    • ACL

Robust and Adversarial Explanations (6)

  • Section Highlight: “How do I fool you?”: Manipulating User Trust via Misleading Black Box Explanations
    • 2020

    • This paper shows that explanations can be generated for a blackbox bail recomendation system that lead law school students to be willing to deploy the model in practice, even when that model actually relies on race and gender as features and the students previously stated these features were unacceptable to use. Bail recommendation is essentially a prediction problem — by law, a defendent is released on bail only if a judge believes they will not flee or commit another crime — and there is historical data on outcomes for defendants who were released on bail. In this context, an "explanation" is a (simple) model of a blackbox model that predicts whether a defendent will flee or commit another crime. They use two-level decision sets obtained from MUSE as the explanations models, because they are highly intuitively interpretable. These are sets of if-then rules that can be used by a person to classify instances (see paper for 7 "interpretability" factors MUSE incorporates). The students unanimously suggested that race and gender were the unacceptable factors to use in a bail recommendation system, and mostly agreed that prior incarcerations and failures to appear before court were two features that should be relied on. So, the authors designed a blackbox model that relied on the prohibited features and made no use of the desired features — in fact, the model was secretly a two-level decision set that the authors fit, which achieved 83.28% accuracy on test data from a dataset of 32K examples in total. The authors then fit 3 explanation models, relying on (1) neither prohibited nor desired features, (2) both prohibited and desired features, (3) desired features but not prohibited features. The explanation models all approximate the blackbox model well, obtaining test accuracies upwards of 97%.

      The user study design is to show participants one of the explanations (i.e., decision sets) for the blackbox model (not shown), then give them a binary question: "Based on this explanation, would you trust the underlying model enough to deploy it?" Users were explicitly instructed beforehand that the explanations were correlational and not causal. To check that the explanation model's accuracy at approximating the blackbox model (fidelity) will not be the source of variation in the question outcome, they confirm that each explanation obtains similar accuracy, ranging from 97.3% to 98.9%. Some users are shown the actual "blackbox" model (which is a known decision set) as the "explanation," which uses the prohibited but not the desired features (hence rounding out the 2x2 condition design for using prohibited and/or desired features).

      Here are the rates at which users would deploy the model that relies on race/gender but not the desired features, after having seen an explanation model with the following features:

      1. Includes prohibited, excludes desired: 9.1%
      2. Includes prohibited and desired: 10%
      3. Excludes prohibited and desired: 70%
      4. Excludes prohibited, includes desired: 88%

      Thus, they conclude that explanations with very high fidelity can be constructed that lead users to be willing to deploy models that actually rely on features they do not want to be relied on, and exclude features they think should be included.

    • AIES

  • Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
    • 2020
    • The authors demonstrate a method for training models that are heavily biased (e.g. race or gender as a dominant factor in decision-making) but are able to fool off-the-shelf implementations LIME and SHAP, i.e. these features are not flagged as important by these techniques. The main idea behind this adversarial attack is that the manifold of the original distribution is different from the manifold of the perturbed data used by LIME and SHAP. This allows the authors to train out-of-distribution (OOD) classifiers that can detect whether or not a given data point is in-distribution by using the real data and perturbed data for training (with corresponding labels). The adversarial attack consists of both a biased model (which uses the sensitive feature as the only factor) as well as an unbiased model which uses synthetic uncorrelated features, with the OOD classifier deciding which model to use for prediction. They demonstrate this attack on three datasets: COMPAS (6172 samples), Community and Crime (1994 samples), and German Credit (1000 samples). In every case, the sensitive feature shifts downwards in importance relative to the biased model, from appearing as the top feature 100% of the time (in the biased case), to almost always being the second most important feature (or third), for only a fraction of the dataset. The authors also show that their adversarial model tends to agree with the unbiased model on the test set data, with 100% similarity for the LIME adversarial model on all 3 datasets, and 75-91% similarity for the SHAP adversarial model. The authors end by showing how their attack relies on the accuracy of their OOD classifier; LIME requires a better OOD classifier, but is more susceptible to the attack once the classifier is accurate, whereas SHAP has a better response curve to a less accurate OOD classifier but the change is more gradual.
    • AIES
  • Analyzing the Interpretability Robustness of Self-Explaining Models
    • 2020
    • In this paper, the authors demonstrate a lack of robustness in a recent class of interpretable models, the self-explaining model (SEM). SEMs are a model that output a linear combination of higher-order features, where the coefficients are input-dependent. In this way, the higher-order concepts are actually used to produce the model output. The authors demonstrate that SEMs are susceptible to perturbations in the concept space; that is, they show that small perturbations in the input lead to different explanations, but still the same output. Hence, the explanations become decoupled from the model's output. They demonstrate a targeted attack which focuses making the explanation for an input of one class look similar to the explanation of another class. Using this attack on SENN, a SEM model, on the MNIST dataset, the authors show that their attack leads to the smallest difference in the higher-order concept space, compared to both out-of-class data (which is expected), as well as in-class data. The authors also demonstrate an untargeted attack that focuses on increasing the distance between the higher-order output and a prototype. Prototypes are used from PrototypeDL, another SEM model which uses proximity to "prototype" images as the higher-order explanation. Again using the MNIST dataset, the authors show that their attack leads to a large minimal distance between input images and prototypes; for many input images, the closest prototype is now an image of a different class label.
    • arxiv
  • Robust and Stable Black Box Explanations
    • 2020

    • Under the view of an explanation as a simple model approximating a complex model, this paper learns explanations that are robust to distribution shifts in the input space. In this view, distribution shift is particularly important to consider when explaining blackbox models because many of the model problems we wish to diagnose with explanations are related to distribution shift, like identifying model reliance on spurious features. They note that a consequence of optimizing for stability under distribution shifts is that they gain some robustness against adversarial inputs, and the approach can even help with explanation identifiability (i.e. identifiability of the simple model).

      They formally introduce the set of distribution shifts they consider, which include shifts over a subset of the features, and they present an approximate objective to optimize for: the loss of the model under the worst case distribution shift in their possible shifts — this is the connection between robustness against distribution shifts and adversarial inputs. The approach is given for two kinds of explanation models, linear models and decision sets.

      Experiments cover blackbox models of several tabular datasets, and they report the explanation fidelity (accuracy at predicting blackbox model outputs), correctness (similarity to the blackbox model, when the blackbox model is actually in the same model family as the explanation), and stability (similarity between a blackbox and explanation when the explanation is trained only on data from a shifted distribution). Under each of these metrics, they find that their approach far outperforms other explanation techniques, including simple models with standard training procedures as well as LIME and SHAP. Whereas the accuracy of a "global" explanation constructed from LIME (i.e. a set of LIME models covering the input space) drops by 14-19 percentage points under distribution shift, their analogous set of linear models drops only by 0-3.3 points. Likewise, using a distance metric on models in the same family, they find that their approach produces explanation models that are more similar to "blackbox" models (in the same family as the explanations) when they train explanations on either perturbed data points or data points from a shifted data distribution.

    • ICML

  • Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense
    • 2020
    • The authors propose X-Ensemble, a method for dealing with adversarial examples in image recognition. X-Ensemble first classifies the given input as benign or adversarial; if adversarial, it attempts to "rectify" the input before running the "real" model. Thus, X-Ensemble consists of the Detector (checks if input is adversarial), the Rectifier (modifies adversarial input to be benign), and the actual model. The authors train the Detector by using information from several sensitivity analysis methods—Vanilla Gradients, Integrated Gradients, Guided Backpropagation, and Layer-wise Relevance Backpropagation—from the data as inputs to four DNNs, and their own synthetic adversarial data as the combined training set. The final Detector is a Random Forest model using these four sub-models. The authors conjecture that, in an adversarial example, pixels with large gradients (relative to one of the sensitivity methods) are likely to be misleading, so their Rectifier is a model that erases pixels with gradients larger than some threshold. The authors test X-Ensemble on Fashion-MNIST, CFAR-10, and ImageNet with respect to five attack algorithms in both the targeted and untargeted case. X-Ensemble generally performs better compared to three other baseline algorithms for adversarial robustness across the different attacks and datasets, with increases in performance between 3 and 10 percentage points.
    • KDD
  • Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations
    • 2020

    • The authors propose a simple search procedure for revealing inconsistent explanations generated across data points by a model that produces natural language explanations along with its label predictions for a task. For example, they find pairs of inputs that lead the model to produce the explanation "Snowboarding is done outside" for one input and "Snowboarding is not done outside" for the other input. This is a problem if you consider explanations to be beliefs actually held by a model and would like for model beliefs to be consistent across data point predictions, absent updates to the model. The method itself is the following procedure: (1) For a data point (x,y), get the model explanation e = explain(x,y). (2) Using templates/schema, generate a set of conflicting explanations that are inconsistent with e. (3) Use a "reverse" model of to generate proposal inputs. (4) Pass the proposal inputs through the model, and check if any of the resulting explanations are also members of the set of conflicting explanations (conflicting with the original explanation for the real data point). If one of these explanations is in the set, we've found a proposed input that leads the model to generate an explanation that is inconsistent with the explanation for the original input. The authors conduct experiments with the e-SNLI dataset (see "e-SNLI" paper in Natural Language Explanations), which includes human explanations for the SNLI dataset. In SNLI, a data point consists of a premise and a hypothesis pair, which must be classified into {neutral, entailment, contradiction} according to the relationship between them. Here, their "reverse" model generates proposal hypotheses only, since the premises are supposed to be taken at face value and different premises could naturally yield inconsistent explanations. Altogether, they use their search procedure with a model trained on e-SNLI and identify about 450 inconsistent explanations, starting with 9824 test points. They suggest that given the simplicity of their method and starkness of the inconsistencies identified, a success rate of 4.5% is far too high.

      Note that the model the authors evaluate is actually of form: y = f(e), e = g(x). This is a workable model only because in SNLI, the form of the explanation (as opposed to its semantics) gives the label away around 97% of the time. For instance, "P implies Q" is almost always an explanation for the "entailment" label. So wherever above that y is conditioned on, technically they do not directly condition on y.

    • ACL Short Paper

Unit Testing (1)

  • Section Highlight: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
    • 2020
    • In NLP, many models achieve upwards of 90% accuracy on widely used benchmark test sets, and yet they still make simple mistakes, like a sentiment model failing to flip its prediction when a statement is negated. This paper (1) catalogs a wide variety of linguistic capabilities that "SOTA" research models continue to fail simple tests for, (2) shows that commercial models from Google, Microsoft, and Amazon make many of the same mistakes as the publicly available RoBERTa model (or do even worse), and (3) provides software (CheckList) for templated production of tests for these basic capabilities, which they put to use in expert user studies. The tests fall into three categories: Minimum Functionality Tests (MFT) where the model must pass basic performance checks, invariance tests where the model should not change its output under certain input transformations (INV), and directional expectation tests where the model should change its output in a known direction for a given change to the input (DIR). Experiments are done with sentiment analysis data, Quora Question Pairing (task is to identify if two questions are the same), and Machine Comprehension (task is simple question answering based on context sentences). For most capabilities, failure rates on tests made by the authors with CheckList range from 30-100%, far worse than the benchmark test set error rates. In a 5 hour user session with the Microsoft research team responsible for their sentiment model, which had already been extensively stress-tested with public feedback, the team uncovered "many previously unknown bugs" (i.e. systematic model failures). An additional user experiment with 18 NLP practitioners found that in two hours, users with access to CheckList and the linguistic templates uncovered about 3 times as many bugs as a control group without CheckList.
    • ACL

Explaining RL Agents (8)

  • Section Highlight: Explainable Reinforcement Learning Through a Causal Lens
    • 2020

    • This paper presents a series of formal definitions of what an explanation is in the context of structural causal models of an RL agent, then proposes a procedure for generating explanations of agent behavior. The authors' goal is to develop a procedure for explaining agents' actions themselves, rather than give explanations of why a state counts as evidence favoring some action. The definitions require some technical context, but roughly speaking: A structural causal model of an agent is a graph representing causal relationships between state, action, and reward nodes, with equations specifying each relationship in the graph. They define an action influence model as a causal graph plus a set of structural equations, with structural equations for each unique variable value and action pair (meaning multiple equations per variable value). Next, they say that (1) a complete explanation is the complete causal chain from an action to any future reward it leads to, (2) a minimally complete explanation is the set of parent nodes to an action, parent nodes to resulting rewards, and the rewards (so complete minus the nodes that aren't parents to rewards), (3) a counterfactual instantiation for a counterfactual action B is the condition under which the model would select action B and the condition resulting from this selection given the SCM, and, lastly, (4) a minimally complete contrastive explanation is an explanation which "extracts the actual causal chain for the taken action A, and the counterfactual causal chain for the B, and finds the differences."

      They give an example minimally complete contrastive explanation for why a Starcraft-playing agent chooses to not build barracks (from a formal explanation plugged into a natural language template): "Because it is more desirable to do action Build Supply Depot to have more Supply Depots as the goal is to have more Destroyed Units and Destroyed buildings."

      How do they generate these explanations? They learn models of the structural equations in their action influence model, conditioned on user-specified causal graphs, by fitting models to observed gameplay by an agent. With learned structural models, they give an algorithm for predicting the action an agent will take in a given state. From here, they can get explanations in the above forms. They validate the learned structural models by checking that they can predict what agents will do. Prediction accuracies range from 68.2 to 94.7 across six games, including Starcraft and OpenAI Gym environments.

      Explanations are evaluated with a human subject experiment. They test two hypotheses: that receiving explanations will improve users' mental models of the agents, as measured by their ability to predict what the agent will do in a given state, and that explanations will improve trust, as measured by subjective reports on a Likert-scale. There are four conditions: (1) explanations come from their full explanation system, (2) they come from their system with more granular "atomic" actions, (3) explanations are based only on relevant variables, from prior work, given in the form "Action A is likely to increase relevant variable P" and (4) no explanations. They conduct experiments on Mechanical Turk with 120 users: after a training phase where participants learn what Starcraft-playing agents are doing, they enter a learning phase where they see 5 videos and after each are allowed to ask as many questions about the agent behavior as they'd like (in the form why/why-not action X). Next, they predict what the agent will do in 8 given situations. Lastly, users complete the trust battery, rating explanations based on whether they are complete, sufficient, satisfying, and understandable.

      They find that given their explanation system, users are better able to predict agent behavior than in the "no explanation" or "relevant variables explanation" conditions. The improvement over the relevant variables condition is equivalent to getting one more action prediction correct out of 16 data points. Their results for the effect on trust are not statistically significant in all cases, but across the measured dimensions of trust their system improves ratings by between 0.3 and 1.0 points on their 5 point Likert scale.

    • AAAI

  • Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences
    • 2018

    • From the paper's conclusion:

      "We proposed a method for a reinforcement learning (RL) agent to generate explanations for its actions and strategies. The explanations are based on the expected consequences of its policy. These consequences were obtained through simulation according to a (learned) state transition model. Since state features and numerical rewards do not lend themselves easily for an explanation that is informative to humans, we developed a framework that translates states and actions into user-interpretable concepts and outcomes.

      We also proposed a method for converting the foil, –or policy of interest to the user–, of a contrastive ’why’-question about actions into a policy. This policy follows locally the user’s query but gradually transgresses back towards the original learned policy. This policy favors the actions that are of interest to the user such that the agent tries to perform them as best as possible. How much these actions are favored compared to the originally learned action can be set with a single parameter.

      Through running simulations for a given number steps of both the policy derived from the user’s question and the actually learned policy, we were able to obtain expected consequences of each. From here, we were able to construct contrastive explanations: explanations addressing the consequences of the learned policy and what would be different if the derived policy would have been followed.

      An online survey pilot study was conducted to explore which of several explanations are most preferred by human users. Results indicate that users prefer explanations about policies rather than about single actions."

    • IJCAI XAI workshop

  • Counterfactual States for Atari Agents via Generative Deep Learning
    • 2019

    • With RL agents trained on Atari games, the authors aim to produce counterfactual states for a given state that an agent is in, which are defined as the closest states that result in a different action under the policy. This is done by learning a generative model of states conditioned on latent state representations and the policy network's distribution over actions. Then, a gradient-based search for a representation is performed to yield a different action under the policy, and a counterfactual state is generated from this representation. The authors argue that the policy model's latent space is too high dimensional for generation out of this space to produce coherent images. Hence, they learn a Wasserstein autoencoder on the policy model's latent space, and perform the search in this lower-dimensional space. Another training trick means that the state representations actually used for generation don't encode any information about a preferred action, unlike those in the policy network, so that the generator will meaningfully rely on the action distribution it is given. The overall generation procedure is as follows: Given a state and an agent, they pass the state through the policy network and then through the autoencoder to get a low-dimensional representation, then perform a gradient-based search in that space for the closest representation by L2 distance that yields a user-specified counterfactual action when decoded back into the policy model's latent space and transformed into a distribution over actions. A counterfactual state is generated conditioned on this new counterfactual distribution over actions and a representation of the original state.

      The generations are evaluated by humans for two properties: realism and induced subjective understanding of the observed agent. After 30 human subjects (students and local community members) play Space Invaders for 5 minutes, they are asked to rate the realism of 30 images randomly chosen from a set including real gameplay images, counterfactual generations, and images from a heavily ablated version of their model without the autoencoder. On a scale of 1 to 6, real states received a 4.97 on average, counterfactual states a 4.0, and the ablated model's generations a 1.93. For the subjective user understanding test, participants were first shown a replay of an agent playing the game, then shown 10 pairs of states and counterfactual states (and associated actions for each), with counterfactual states selected to have large deviations from the original state. Users were asked to rate their "understanding of the agent" on a 1-6 scale before and after seeing these states. They found that 15 users' reported understandings improved, 8 declined, and 7 were constant (with a one-sided Wilcoxon signed-rank test for improvement: p=0.098).

    • IJCAI XAI Workshop

  • Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents
    • 2019

    • The paper proposes a method for generating states with certain properties under a policy that are intended to be helpful with analyzing the policy. In particular, they identify states with large Q-values for certain actions, like hard braking by a simulated self-driving car, a large difference between best and worst Q-values (clear-cut situations), or low Q-values across actions (hopeless situations). They note that the immediate approach to doing this, for continuous states like in the Atari games they experiment with, is activation maximization of a Q-value (or function on Q-values) with respect to the input image, but they find that in practice this produces meaningless images outside of the natural state distribution, even when a variety of tricks are used. In response, they encode states in a low-dimensional space with a VAE and perform the activation maximization by gradient ascent in this embedding space. Interestingly, they search for the parameters of a distribution over embeddings, , rather than just a single embedding; later, they find that the results of the search allow them to generate samples using the VAE decoder. The VAE objective has a reconstruction loss (to generate realistic images) and a penalty on the reconstruction resulting in a different action from the original training image. They find that it is necessary to focus the reconstruction error on regions "important" to the agent, which means they weight the L2 reconstruction loss by a measure of pixel saliency obtained by applying a gradient-based saliency method to policy at a given state. The generator is trained with trajectories from a fixed agent.

      They provide a great deal of qualitative analysis using their generated states. A few highlights include: In Seaquest, where the player must resurface from below water when an oxygen tank is low, they suggest that an agent does not understand that they must resurface when low on oxygen, after optimizing states for the Q-value of resurfacing. They note that while "it would be possible to identify this flawed behavior by analyzing the 10,000 frames of training data for our generator, it is significantly easier to review a handful of samples from our method." The generator can also yield examples not seen during training. With agents trained as simulated self-driving cars in an environment built by the authors, they find evidence of absence of the ability of a policy to avoid pedestrians: with a policy trained using "reasonable pedestrians" that never crossed while there was traffic, they observe that among states maximizing the Q-value of braking, states with pedestrians in the road are conspicuously absent. This policy shortcoming is then verified in a test environment where pedestrians cross while there is oncoming traffic, and they find that the agent will run over pedestrians.

    • arxiv

  • Towards Interpretable Reinforcement Learning Using Attention Augmented Agents
    • 2019
    • The authors propose a policy network with a spatial attention mechanism and perform qualitative analysis of the attention weights to analyze agent behavior. The network has an interesting structure: at a given timestep, a query model, which is an LSTM, produces a query vector that is passed to an attention layer that takes a representation of the current state (produced by another model) as the keys and values. The resulting vector is used to obtain an action and is passed back to the LSTM. They emphasize the "top-down" nature of the attention: the query network determines the attention weights for a given state representation. On experiments with Atari games, they find that this model obtains higher average rewards than baseline feed-forward or LSTM-based models. They provide qualitative analysis (including videos) of the spatial attention, and suggest that their model pays attention to task-relevant aspects of states. They also compare their attention-based analysis against saliency scores returned by an existing saliency method, for both their attentive policy and a feed-forward baseline. Performing qualitative analysis of agent behavior using the existing saliency method, they report apparent differences in the learned behaviors of these models.
    • NeurIPS
  • Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement Learning
    • 2020

    • From Alignment Newsletter #101:

      This paper presents an analysis of the use of saliency maps in deep vision-based reinforcement learning on ATARI. They consider several types of saliency methods, all of which produce heatmaps on the input image. They show that all (46 claims across 11 papers) uses of saliency maps in deep RL literature interpret them as representing the agent's ""focus"", 87% use the saliency map to generate a claim about the agent's behaviour or reasoning, but only 7% validate their claims with additional or more direct evidence.

      They go on to present a framework to turn subjective and under-defined claims about agent behaviour generated with saliency maps into falsifiable claims. This framework effectively makes the claim more specific and targeted at specific semantic concepts in the game's state space. Using a fully parameterized version of the ATARI environment, they can alter the game's state in ways which preserve meaning (i.e. the new state is still a valid game state). This allows them to perform interventions in a rigorous way, and falsify the claims made in their framework.

      Using their framework, they perform 3 experimental case studies on popular claims about agent behaviour backed up by saliency maps, and show that all of them are false (or at least stated more generally than they should be). For example, in the game Breakout, agents tend to build tunnels through the bricks to get a high score. Saliency maps show that the agent attends to these tunnels in natural games. However, shifting the position of the tunnel and/or the agent's paddle and/or the ball all remove the saliency on the tunnel's location. Even flipping the whole screen vertically (which still results in a valid game state) removes the saliency on the tunnel's location. This shows that the agent doesn’t understand the concept of tunnels generally or robustly, which is often what is claimed.

    • ICLR

  • Understanding RL Vision
    • 2020

    • From Alignment Newsletter #128:

      This work presents an interface for interpreting the vision of a reinforcement learning agent trained with PPO on the CoinRun game. This game is procedurally generated, which means the levels are different in every episode of playing. The interface primarily uses attribution from a hidden layer to the output of the value function. This interface is used in several ways.

      First, they use the interface to dissect failed trajectories of the policy (it fails in 1 out of 200 levels). They're able to understand why the failures occurred using their interface: for example, in one case the view of the agent at the top of its jump means it can't see any platforms below it, so doesn't move to the right fast enough to reach the platform it was jumping for, leading it to miss the platform and fail the level. Second, they use the interface to discover "hallucinations", where the value function mistakes one element of the environment for another, causing its value to drop or rise significantly. Often these hallucinations only last a single time-step, so they don't affect performance.

      Finally, they use the attributions specifically to hand-edit the weights of the model to make it "blind" to buzzsaws (one of the hazards) by zeroing the feature which recognises them. After doing this, they show that the edited agent fails a lot more from buzzsaw failures but no more from other types of failures, which gives a quantitative justification for their interpretation of the feature as buzzsaw-recognising.

      From using this interface, they propose the diversity hypothesis: Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction). This is based on the fact that interpretable features arise more when the agent is trained on a wider variety of levels. There also seems to be a qualitative link to generalisation - a wider distribution of training levels leads to better interpretability (measured qualitatively) and better generalisation (measured quantitatively).

    • Distill

  • Causal Analysis of Agent Behavior for AI Safety
    • 2021

    • From Alignment Newsletter #141:

      A common challenge when understanding the world is that it is very hard to infer causal structure from only observational data. Luckily, we aren’t limited to observational data in the case of AI systems: we can intervene on either the environment the agent is acting in, or the agent itself, and see what happens. In this paper, the authors present an “agent debugger” that helps with this, which has all the features you’d normally expect in a debugger: you can set breakpoints, step forward or backward in the execution trace, and set or monitor variables.

      Let’s consider an example where an agent is trained to go to a high reward apple. However, during training the location of the apple is correlated with the floor type (grass or sand). Suppose we now get an agent that does well in the training environment. How can we tell if the agent looks for the apple and goes there, rather than looking at the floor type and going to the location where the apple was during training?

      We can’t distinguish between these possibilities with just observational data. However, with the agent debugger, we can simulate what the agent would do in the case where the floor type and apple location are different from how they were in training, which can then answer our question.

      We can go further: using the data collected from simulations using the agent debugger, we can also build a causal model that explains how the agent makes decisions. We do have to identify the features of interest (i.e. the nodes in the causal graph), but the probability tables can be computed automatically from the data from the agent debugger. The resulting causal model can then be thought of as an “explanation” for the behavior of the agent.

    • arxiv

Interpretability in Practice (2)

  • Section Highlight: Explainable Machine Learning in Deployment
    • 2020
    • This paper explores how explainability techniques are actually used by organizations through interview and synthesis. This consisted of interviewing twenty data scientists not currently using explainability tools and thirty individuals from organizations which have deployed such tools. The first group primarily thought explainability techniques would be valuable for model debugging (understanding poor performance), model monitoring (being alerted to drift in performance), model transparency (explaining output to others), and model audit (amenability to risk assessments by other teams). The second group was asked what tools they used in practice. They found that feature importance was the most common explainability tool used, among choices also including counterfactual explanation, adversarial training, and influential samples. For feature importance, the authors found that Shapley values were commonly used, and they were typically shown to ML engineers and data scientists prior to model deployment. Counterfactual explanations are used in healthcare contexts, but the objective for what to optimize for when generating a counterfactual is still often unclear. The authors recommend that organizations attempt to clarify who the consumers of the explanation are and what the explanation is meant to be used for. They conclude by summarizing concerns that interviewees have about current explainability tools. These include concerns related to determining causality, maintaining data privacy, improving model performance, and a lack of model-specific tools for non-deep-learning models.
    • ACM FAT
  • The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
    • 2020

    • The authors introduce a browser-based GUI for exploring NLP model behavior, intended to enable researchers to answer questions like: (1) why did a model make this prediction? (2) on what data points does the model perform poorly? and (3) what happens to behavior under controlled changes to the model input? The interface services a variety of models, including classification, sequence to sequence, and structured prediction models. Features include mechanisms for (1) exploring your dataset, (2) finding interesting data points and outliers, (3) explaining local model behavior through LIME and salience maps, (4) generating new data points by backtranslation, word substitutions, and adversarial attacks, (5) side-by-side comparison of two models, and (6) computing metrics on selections of datapoints or automatically-selected slices of the data. In case studies with sentiment analysis classifiers, coreference models, and text generation, they identify several cases of model pathologies and identify possible causes of the behavior. For instance, with an errant text generation from T5, they do nearest neighbor lookups based on decoder embeddings and find that a number of similar points used a certain phrase structure, which may have biased that model to (incorrectly) repeating that phrase structure in a new setting.

      A few design principles guided the system development, including flexibility, extensibility, modularity, agnosticism to deep learning framework, and ease of use. The end product is one which the authors hope will be easy for researchers to interact with and build a better understanding of their models. In comparison to tools like AllenNLP Interpret, the authors note that their preference for a framework-agnostic GUI means that they are more easily able to provide analysis through methods that handle arbitrary functions (like LIME) rather than methods that require full access to model internals, like Integrated Gradients.

    • EMNLP

Additional Papers

We provide some additional papers here that we did not summarize above, including very recent papers, highly focused papers, and others. These are organized by the same topic areas as above.

Theory and Opinion (12)

Evaluation (10)

Methods: Estimating Feature Importance (16)

Methods: Interpreting Representations and Weights (7)

Methods: Generating Counterfactuals and Recourse Procedures (8)

Methods: Explanation by Examples, Exemplars, and Prototypes (4)

Methods: Finding Influential Training Data (4)

Methods: Natural Language Explanations (9)

Methods: Developing More Easily Interpreted Models (3)

Methods: Robust and Adversarial Explanations (5)

Explaining RL Agents (2)

Datasets and Data Collection (1)

Interpretability in Practice (2)

Conclusion

We hope this post can serve as a useful resource and help start important conversations about model interpretability and AI Safety. As mentioned, please let us know if you noticed any mistakes or think we missed anything that could improve the post.

New Comment
10 comments, sorted by Click to highlight new comments since:

This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.

I agree. I've put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!

I'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.

It's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below). 

Here's a Wikipedia article I pasted into SuperMemo. Blue bits are the extracts, which it'll remind me to refine into flashcards later.
A cloze deletion flashcard. It's easy to make a lot of these. I like them.

Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away. 

In the context of this post, I'm reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!

I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.

Apparently VMs are the way to go for pdf support on linux.

Planned summary for the Alignment Newsletter:

This is basically 3 months worth of Alignment Newsletters focused solely on interpretability wrapped up into a single post. The authors provide summaries of 70 (!) papers on the topic, and include links to another 90. I’ll focus on their opinions about the field in this summary.

The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.

There have been lots of proposals for how to evaluate interpretability methods, leading to the [problem of too many standards](https://xkcd.com/927/). The authors speculate that this is because both “methods” and “evaluation” papers don’t have sufficient clarity on what research questions they are trying to answer. Even after choosing an evaluation methodology, it is often unclear which other techniques you should be comparing your new method to.

For specific methods for achieving interpretability, at a high level, there has been clear progress. There are cases where we can:

1. identify concepts that certain neurons represent,

2. find feature subsets that account for most of a model's output,

3. find changes to data points that yield requested model predictions,

4. find training data that influences individual test time predictions,

5. generate natural language explanations that are somewhat informative of model reasoning, and

6. create somewhat competitive models that are inherently more interpretable.

There does seem to be a problem of disconnected research and reinventing the wheel. In particular, work at CV conferences, work at NLP conferences, and work at NeurIPS / ICML / ICLR form three clusters that for the most part do not cite each other.

Planned opinion:

This post is great. Especially to the extent that you like summaries of papers (and according to the survey I recently ran, you probably do like summaries), I would recommend reading through this post. You could also read through the highlights from each section, bringing it down to 13 summaries instead of 70.

[-][anonymous]40

Hi Rohin! Thanks for this summary of our post. I think one other sub-field that has seen a lot of progress is in creating somewhat competitive models that are inherently more interpretable (i.e. a lot of the augmented/approximate decision tree models), as well as some of the decision set stuff. Otherwise, I think it's a fair assessment, will also link this comment to Peter so he can chime in with any suggested clarifications of our opinions, if any. Cheers, Owen

Sounds good, I've added a sixth bullet point. Fyi, I originally took that list of 5 bullet points verbatim from your post, so you might want to update that list in the post as well.

Can you add author names?