Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, there are some gaps between the research and engineering applications. If one of our main goals for interpretability research is to help us align highly intelligent AI systems in high-stakes settings, we need more tools that help us better solve practical problems.
One of the unique advantages of interpretability tools is that, unlike test sets, they can sometimes allow humans to characterize how networks may behave on novel examples. For example, Carter et al. (2019), Mu and Andreas (2020), Hernandez et al. (2021), Casper et al. (2022a), and Casper et al. (2023) have all used different interpretability tools to identify novel combinations of features that serve as adversarial attacks against deep neural networks.
The key to the competition is to develop interpretations of the model that help human crowdworkers discover trojans: specific vulnerabilities implanted into a network in which a certain trigger feature causes the network to produce unexpected output. In addition, we also offer an open-ended challenge for participants to discover the triggers for secret trojans by any means necessary.
The motivation for this trojan-discovery competition is that trojans are bugs caused by novel trigger features -- they usually can’t be identified by analyzing model performance on some readily available dataset. This makes finding them a challenging debugging task that mirrors the practical challenge of finding unknown bugs in models. However, unlike naturally occurring bugs in neural networks, the trojan triggers are known to us, so it will be possible to know when an interpretation is causally correct or not. In the real world, not all types of bugs in neural networks are likely to be trojan-like. However, benchmarking interpretability tools using trojans can offer a basic sanity check.
The Benchmark
This competition follows new work from Casper et al. (2023) (will be at NeurIPS 2023), in which we introduced a benchmark for interpretability tools based on helping human crowdworkers discover trojans that had interpretable triggers. We used 12 trojans of three different types: ones that were triggered by patches, styles, and naturally occurring features.
We then evaluated 9 methods meant to help detect trojan triggers: TABOR, (Guo et al., 2019), four variants of feature visualizations, (Olah et al., 2017; Mordvintsev et al., 2018), adversarial patches (Brown et al., 2017), two variants of robust feature-level adversaries (Casper et al., 2022a), and SNAFUE (Casper et al., 2022b). We tested each based on how much they helped crowdworkers identify trojan triggers in multiple-choice questions. Overall, this work found some successes. Adversarial patches, robust feature-level adversaries, and SNAFUE were relatively successful at helping humans discover trojan triggers.
However, even the best-performing method -- a combination of all 9 tested techniques -- failed to help humans identify trojans successfully from multiple-choice questions half of the time. The primary goal of this competition is to improve on these methods.
In contrast to prior competitions such as the Trojan Detection Challenges, this competition uniquely focuses on interpretable trojans in ImageNet CNNs including natural-feature trojans.
Main competition: Help humans discover trojans >= 50% of the time with a novel method
Prize: $4,000 for the winner and shared authorship in the final report for all submissions that beat the baseline.
The best method tested in Casper et al. (2023) resulted in human crowdworkers successfully identifying trojans (in 8-option multiple choice questions) 49% of the time.
How to submit:
Submit a set of 10 machine-generated visualizations (or other media, e.g. text) for each of the 12 trojans, a brief description of the method used, and code to reproduce the images. In total, this will involve 120 images (or other media), but please submit them as 12 images, each containing a row of 10 sub-images.
Once we check the code and images, we will use your data to survey 100 knowledge workers using the same method as we did in the paper.
We will desk-reject submissions that are incomplete (e.g. not containing code), not reproducible using the code sent to us, or produced entirely with code off-the-shelf from someone other than the submitters. The best-performing solution at the end of the competition will win.
Bonus challenge: Discover the four secret natural feature trojans by any means necessary
Prize: $1,000 split among all submitters who identify each trojan and shared authorship in the final report.
The trojaned network has 12 disclosed trojans but 4 additional secret ones (the bottom four rows of the table below).
How to submit:
Share with us a guess for one of the trojans, along with code to reproduce whatever method you used to make the guess and a brief explanation of how this guess was made. One guess is allowed per trojan per submitter.
The $1,000 prize for each of the 4 trojans will be split between all successful submissions for that trojan.
What techniques might succeed?
Different tools for synthesizing features differ in what priors they place over the generated feature. For example, TABOR (Guo et al., 2019) imposes a weak one, while robust feature-level adversaries (Casper et al., 2022a) impose a strong one. Since the trojans for this competition are human-interpretable, we expect methods that visualize trojan triggers with highly-regularized features to be useful. Additionally, we found in Casper et al. (2023) that combinations of methods succeeded more than any individual method on its own, so techniques that produce diverse synthesized features may have an advantage. We also found that style trojans were the most difficult to discover, so methods that are well-suited to finding these will be novel and useful. Finally, remember that you can think outside the box! For example, captioned images are fair game.
TL;DR
I am excited to announce the CNN Interpretability Competition, which is part of the competition track of SATML 2024.
Dates: Sept 22, 2023 - Mar 22, 2024
Competition website: https://benchmarking-interpretability.csail.mit.edu/challenges-and-prizes/
Total prize pool: $8,000
NeurIPS 2023 Paper: Red Teaming Deep Neural Networks with Feature Synthesis Tools
Github: https://github.com/thestephencasper/benchmarking_interpretability
For additional reference: Practical Diagnostic Tools for Deep Neural Networks
Correspondence to: interp-benchmarks@mit.edu
Intro and Motivation
Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, there are some gaps between the research and engineering applications. If one of our main goals for interpretability research is to help us align highly intelligent AI systems in high-stakes settings, we need more tools that help us better solve practical problems.
One of the unique advantages of interpretability tools is that, unlike test sets, they can sometimes allow humans to characterize how networks may behave on novel examples. For example, Carter et al. (2019), Mu and Andreas (2020), Hernandez et al. (2021), Casper et al. (2022a), and Casper et al. (2023) have all used different interpretability tools to identify novel combinations of features that serve as adversarial attacks against deep neural networks.
Interpretability tools are promising for exercising better oversight, but human understanding is hard to measure, and it has been difficult to make clear progress toward more practically useful tools. Here, we work to address this by introducing the CNN Interpretability Competition (accepted to SATML 2024).
The key to the competition is to develop interpretations of the model that help human crowdworkers discover trojans: specific vulnerabilities implanted into a network in which a certain trigger feature causes the network to produce unexpected output. In addition, we also offer an open-ended challenge for participants to discover the triggers for secret trojans by any means necessary.
The motivation for this trojan-discovery competition is that trojans are bugs caused by novel trigger features -- they usually can’t be identified by analyzing model performance on some readily available dataset. This makes finding them a challenging debugging task that mirrors the practical challenge of finding unknown bugs in models. However, unlike naturally occurring bugs in neural networks, the trojan triggers are known to us, so it will be possible to know when an interpretation is causally correct or not. In the real world, not all types of bugs in neural networks are likely to be trojan-like. However, benchmarking interpretability tools using trojans can offer a basic sanity check.
The Benchmark
This competition follows new work from Casper et al. (2023) (will be at NeurIPS 2023), in which we introduced a benchmark for interpretability tools based on helping human crowdworkers discover trojans that had interpretable triggers. We used 12 trojans of three different types: ones that were triggered by patches, styles, and naturally occurring features.
We then evaluated 9 methods meant to help detect trojan triggers: TABOR, (Guo et al., 2019), four variants of feature visualizations, (Olah et al., 2017; Mordvintsev et al., 2018), adversarial patches (Brown et al., 2017), two variants of robust feature-level adversaries (Casper et al., 2022a), and SNAFUE (Casper et al., 2022b). We tested each based on how much they helped crowdworkers identify trojan triggers in multiple-choice questions. Overall, this work found some successes. Adversarial patches, robust feature-level adversaries, and SNAFUE were relatively successful at helping humans discover trojan triggers.
However, even the best-performing method -- a combination of all 9 tested techniques -- failed to help humans identify trojans successfully from multiple-choice questions half of the time. The primary goal of this competition is to improve on these methods.
In contrast to prior competitions such as the Trojan Detection Challenges, this competition uniquely focuses on interpretable trojans in ImageNet CNNs including natural-feature trojans.
Main competition: Help humans discover trojans >= 50% of the time with a novel method
Prize: $4,000 for the winner and shared authorship in the final report for all submissions that beat the baseline.
The best method tested in Casper et al. (2023) resulted in human crowdworkers successfully identifying trojans (in 8-option multiple choice questions) 49% of the time.
How to submit:
We will desk-reject submissions that are incomplete (e.g. not containing code), not reproducible using the code sent to us, or produced entirely with code off-the-shelf from someone other than the submitters. The best-performing solution at the end of the competition will win.
Bonus challenge: Discover the four secret natural feature trojans by any means necessary
Prize: $1,000 split among all submitters who identify each trojan and shared authorship in the final report.
The trojaned network has 12 disclosed trojans but 4 additional secret ones (the bottom four rows of the table below).
How to submit:
The $1,000 prize for each of the 4 trojans will be split between all successful submissions for that trojan.
What techniques might succeed?
Different tools for synthesizing features differ in what priors they place over the generated feature. For example, TABOR (Guo et al., 2019) imposes a weak one, while robust feature-level adversaries (Casper et al., 2022a) impose a strong one. Since the trojans for this competition are human-interpretable, we expect methods that visualize trojan triggers with highly-regularized features to be useful. Additionally, we found in Casper et al. (2023) that combinations of methods succeeded more than any individual method on its own, so techniques that produce diverse synthesized features may have an advantage. We also found that style trojans were the most difficult to discover, so methods that are well-suited to finding these will be novel and useful. Finally, remember that you can think outside the box! For example, captioned images are fair game.