What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called "The Few Greatest Achievements of Interpretability to Date," what would you recommend I include in the talk?
My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.
Making adversaries:
https://distill.pub/2019/activation-atlas/
https://arxiv.org/abs/2110.03605
https://arxiv.org/abs/1811.12231
https://arxiv.org/abs/2201.11114
https://arxiv.org/abs/2206.14754
https://arxiv.org/abs/2106.03805
https://arxiv.org/abs/2006.14032
https://arxiv.org/abs/2208.08831
https://arxiv.org/abs/2205.01663
Manual fine-tuning:
https://arxiv.org/abs/2202.05262
https://arxiv.org/abs/2105.04857
Reverse engineering (I'd put an asterisk on these ones though because I don't expect methods like this to scale well to non-toy problems):
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Räuker* (traeuker@gmail.com)
Anson Ho* (anson@epochai.org)
Stephen Casper* (scasper@mit.edu)
Dylan Hadfield-Menell
TL;DR: We wrote a survey paper on interpretability tools for deep networks. It was written for the general AI community but with AI safety as the key focus. We survey over 300 works and offer 15 discussion points for guiding future work. Here is a link to a Twitter thread about the paper.
Lately, there has been a growing interest in interpreting AI systems and a growing consensus that it will be key for building safer AI. There have been rapid recent developments in interpretability work, and the AI safety community will benefit from a better systemization of knowledge for it. There are also several epistemic and paradigmatic issues with much interpretability work today. In response to these challenges, we wrote a survey paper covering over 300 works and featuring 15 somewhat “hot takes” to guide future work.
Specifically, this survey focuses on “inner” interpretability methods that help explain internal parts of a network (i.e. not inputs, outputs, or the network as a whole). We do this because inner methods are popular and have some unique applications – not because we think that they are more valuable than other ones. The survey introduces a taxonomy of inner interpretability tools that organizes them by which part of the network’s computational graph they aim to explain: weights (S2), neurons (S3), subnetworks (S4), and latent representations (S5). Then we provide a discussion (S6) and propose directions for future work (S7).
Finally, here are a select few points that we would like to specifically highlight here.
We hope that this survey can give researchers in AI safety a broader sense of what kind of work has been done in interpretability, serve as a useful reference, and also stimulate ideas for further investigation. Overall, we are excited about how much interpretability work has been done in the past few years. And we are looking forward to future progress. Please reach out to the three of us in an email if you’d like to talk.