Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to recommend it. (Algon makes a similar point in another comment.) Though I do agree that, based on the numbers you gave for how many junior researchers' projects are focusing on interpretability, people are probably overweighting it.
I think this post is an example of a fairly common phenomenon where alignment people are too focused on backchaining from desired end states, and not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better. (By contrast, most ML researchers are too focused on the latter.)
Perhaps the main problem I have with interp is that it implicitly reinforces the narrative that we must build powerful, dangerous AIs, and then align them. For X-risks, prevention is better than cure. Let’s not build powerful and dangerous AIs. We aspire for them to be safe, by design.
I particularly disagree with this part. The way you get safety by design is understanding what's going on inside the neural networks. More generally, I'm strongly against arguments of the form "we shouldn't do useful work, because then it will encourage other people to do bad things". In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
After spending a while thinking about interpretability, my current stance is:
Note that this is just for "mechanistic interpretability". I think that high level top down interpretability (both black box and white box) has a clearer story for usefulness which doesn't require very ambitious success.
For mechanistic interpretabilty, very ambitious success looks something like:
I don't think there is an obvious easier road for mech interp to answer questions like "is the model deceptively aligned" if you want the approach to compete with much simpler high level and top down interpretability.
The main reason why I think mechanistic interpretability is very far from ambitious success is that current numbers are extremely bad and what people explain is extremely cherry picked. Like people's explanations typically result in performance which is worse than that of much, much tinier models even though heavy cherry picking is applied.
If people were getting ok perf on randomly selected "parts" of models (for any notion of decomposition), then we'd be much closer. I'd think we were be much closer even if this was extremely labor intensive.
(E.g., the curve detectors work explained ~50% of the loss which is probably well less than 10% of the bits given sharply diminishing returns to scale on typical scaling laws.)
I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.
I can't speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:
I don't usually try to backchain too hard from these theories of change to work done today; I think it's going to be very difficult to predict in advance what kind of affordances we might build in the future with years' more work (similarly to Richard's comment, though I'm focused more on affordances than principled understanding of deep learning; I like principled understanding of deep learning but wouldn't be doing basic research on interpretability if that was my goal).
My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build. As an example of how I reason about what projects to do, I'm now somewhat less excited about projects that do manual circuit analysis of an algorithmic task. They do still teach us new stylized facts about LLMs like "there are often multiple algorithms at different 'strengths' spread across the model" that can help with future mech interp, but overall it feels like these projects aren't pushing the boundaries as much as seems possible, because we're using the same, relatively-well-vetted techniques for all of these projects.
I'm also more keen on applying interpretability to downstream tasks (e.g. fixing issues in a model, generating adversarial examples), but not necessarily because I think it will be better than alternative methods today, but rather because I think the downstream task keeps you honest (if you don't actually understand what's going on, you'll fail at the task) and because I think practice with downstream tasks will help us notice which problems are important to solve vs. which can be set aside. This is an area where other people disagree with me (and I'm somewhat sympathetic to their views, e.g. that the work that best targets a downstream task won't tackle fundamental interp challenges like superposition as well as work that is directly trying to tackle those fundamental challenges).
(EDIT: I mostly agree with Ryan's comment, and I'll note that I am considering a much wider category of work than he is, which is part of why I usually say "interpretability" rather than "mechanistic interpretability".)
Separately, you say:
I don't see how this reduces risks more than other works that they could be doing
I'm not actually sure why you believe this. I think on the views you've expressed in this post (which, to be clear, I often disagree with), I feel like you should think that most of our work is just as bad as interpretability.
In particular we're typically in the business of building aligned models. As far as I can tell, you think that interpretability can't be used for this because (1) it is dual use, and (2) if you optimize against it, you are in part optimizing for the AI system to trick your interpretability tools. But these two points seem to apply to any alignment technique that is aiming to build aligned models. So I'm not sure what other work (within the "build aligned models" category) you think we could be doing that is better than interpretability.
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of "indefinite, or at least very long, pause on AI progress". If that's your position I wish you would have instead written a post that was instead titled "against almost every theory of impact of alignment" or something like that.)
I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it.
It mostly didn't.
My impact story (for the work I am considering doing) is most similar to the "retargeting" story which you briefly mention, but barely critique.
I do think the world would be better off if this were required reading for anyone considering going into interpretability vs other areas. (Barring weird side-effects of the counterfactual where someone has the ability to enforce required reading...) It is a good piece of work which raises many important points.
I broadly agree, but I think there's more safety research along with "Retarget the search" that focuses on using a trained AI's own internals to understand things like deception, planning, preferences, etc, that you didn't mention. You did say this sort of thing isn't a central example of "interpretability," which I agree with, but some more typical sorts of interpretability can be clear instrumental goals for this.
E.g. suppose you want to use an AI's model of human preferences for some reason. To operationalize this, given a description of a situation, you want to pick which of two described alterations to the situation humans would prefer. This isn't "really interpretability," it's just using a trained model in an unintended way that involves hooks.
But if you're doing this, there are going to be different possible slices of the model that you could have identified as the "model of human preferences." They might have different generalization behavior even though they get similar scores on a small human labeled dataset. And it's natural to have questions about these different slices, like "how much are they computing facts about human psychology as intermediaries to its answers, versus treating the preferences as a non-psychological function of the world?", questions that it would be useful to answer with interpretability tools if we could.
Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let’s go!
Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions.
When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable.
How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I’ve added a lot of bookmarks to help modularize this post.
If you have very little time, just read (this is also the part where I’m most confident):
Here is the list of claims that I will defend:
(bolded sections are the most important ones)
Note: The purpose of this post is to criticize the Theory of Impact (ToI) of interpretability for deep learning models such as GPT-like models, and not the explainability and interpretability of small models.
The emperor has no clothes?
I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.
The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!
The overall Theory of Impact is quite poor
Neel Nanda has written A Long List of Theories of Impact for Interpretability, which lists 20 diverse Theories of Impact. However, I find myself disagreeing with the majority of these theories. The three big meta-level disagreements are:
Other less important disagreements:
Here are some key theories with which I disagree:
In the appendix, I critique almost all the other Theories of Impact.
Interp is not a good predictor of future systems
Theory of Impact 2: “Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems and work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws. E.g, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training” from Neel Nanda.
Auditing deception with interp is out of reach
Auditing deception is generally the main motivation for doing interp. So here we are:
Theory of Impact 4: Auditing for deception: Similar to auditing, we may be able detect deception in a model. This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/features - I see this more as a theory of change for 'worlds where interpretability is harder than I hope' from Neel Nanda.
Counteracting deception with only interp is not the only approach:
Inspired by every discussion I’ve had with friends defending interp. “Your argument for astronomy is too general”, so let's deep dive into some object-level arguments in the following section!
What does the end story of interpretability look like? That’s not clear at all.
This section is more technical. Feel free to skip it and go straight to "So far my best ToI for interp: Outreach" , or just read the "Enumerative safety" section, which is very important.
Of course, it seems that interpretability in deep learning is inherently more feasible than neuroscience because we can save all activations and run the model very slowly, by trying causal modifications to understand what is happening, and allows much more control than an fMRI. But it seems to me that this is still not enough - we don't really know what we are aiming for and rely too much on serendipity. Are we aiming for:
Enumerative safety?
Enumerative safety, as Neel Nanda puts it, is the idea that we might be able to enumerate all features in a model and inspect this for features related to dangerous capabilities or intentions. I think this strategy is doomed from the start (from most important to less important):
Reverse engineering?
Reverse engineering is a classic example of interpretability, but I don't see a successful way forward. Would this be:
You can notice that “Enumerative safety” is often hidden behind the “reverse engineering” end story.
From the IOI paper. Understanding this diagram from 'Interpretability in the Wild' by Wang et al. 2022 is not essential for our discussion. Understanding the full circuit and the method used would require a three-hour video. And, this analysis only focuses on a single token and involves numerous simplifications. For instance, while we attempt to explain why the token 'Mary' is preferred over 'John', we do not delve into why the model initially considers either 'Mary' or 'John'. Additionally, this analysis is based solely on GPT2-small.
Indeed, this figure is quite terrifying. from Causal scrubbing: results on induction heads, for a 2 layer model. After refining 4 times the hypothesis, they are able to restore 86% of the loss. But even for this simple task they say “we won’t end up reaching hypotheses that are fully specific or fully human-understandable, causal scrubbing will allow us to validate claims about which components and computations of the model are important.”.
The fact that reverse engineering is already so difficult in the two toy examples above seems concerning to me.
Olah’s interpretability dream?
Or maybe interp is just an exploration driven by curiosity waiting for serendipity?
Overall, I am skeptical about Anthropic's use of the dictionary learning approach to solve the superposition problem. While their negative results are interesting, and they are working on addressing conceptual difficulties around the concept of "feature" (as noted in their May update), I remain unconvinced about the effectiveness of this approach, even after reading their recent July updates, which still do not address my objections about enumerative safety.
One potential solution Olah suggests is automated research: "it does seem quite possible that the types of approaches […] will ultimately be insufficient, and interpretability may need to rely on AI automation". However, I believe that this kind of automation is potentially harmful [section Harmful].
This is still a developing story, and the papers published on Distill are always a great pleasure to read. However, I remain hesitant to bet on this approach.
Retargeting the search?
Or maybe interp could be useful for retargeting the search? This idea suggests that if we find a goal in a system, we can simply change the system's goal and redirect it towards a better goal.
I think this is a promising quest, even if there are still difficulties:
Relaxed adversarial training?
Relaxed adversarial training? The TL;DR is that relaxed adversarial training is the same as adversarial training, but instead of creating adversarial inputs to test the network, we create adversarial latent vectors. This could be useful because creating realistic adversarial inputs is a bottleneck in adversarial training. [More explanations here]
This seems valid but very hard, and there are still significant conceptual difficulties. A concrete approach, Latent Adversarial Training, has been proposed, and seems to be promising but:
The exact procedure described in Latent Adversarial Training hasn't been tested, as far as I know. So we should probably work on it.[8]
Microscope AI?
Maybe Microscope AI i.e. Maybe we could directly use the AI’s world model without having to understand everything. Microscope AI is an AI that would be used not in inference, but would be used just by looking at its internal activations or weights, without deploying it. My definition would be something like: We can run forward passes, but only halfway through the model.
A short case study of Discovering Latent Knowledge technique to extract knowledge from models by probing is included in the appendix.
So far my best ToI for interp: Outreach?
1. Interp for Nerd Sniping/honeypot?
2. Honorable mentions:
Preventive measures against Deception seem much more workable
TL;DR: It would be more desirable to aim for a world where we wouldn't need to delve into the internal components of models. Prevention is better than cure, or at least, it is a neglected strategy.
I don't believe interpretability is the most promising method for monitoring near human-level AGI, Here’s why:
I don't think neural networks will be able to take over in a single forward pass. Models will probably reason in English and will have translucent thoughts (and we could even hope to get transparent thoughts, translucence is the scary place where steganography is possible). In order to devise an effective plan to destroy the world, the first human-level AGI will need to somewhat externalize its reasoning, by chain-of-Thought (CoT), and it seems to me way easier to monitor this chain-of-Thought than probing into the models' internals. We can probably use The Translucent Thoughts Hypotheses (Fabien, the author, gives a 20% chance, but I think it will be more like 60%). I also think that we have to do everything we can to pass regulations and place ourselves in a world where those hypotheses are true.
For example, magic is possible, and the fact that human brains can be manipulated so easily by magic is a great danger. So we should probably try to make this magic unlikely, for example with process-based training by training AI Systems to reason only step-by-step (the process-based training approach is summarized in this AGISF distillation).
Steering the world towards transparency
Thinking ahead about the training setup is very important and possibly dominates interp considerations because if the concepts manipulated and by the models are totally alien, it will be much, much harder to provide oversight. And it is much easier to align those chatbots pretrained on human generated text than aligning AGIs trained from scratch with RL / evolutionary methods, etc.
If this is the case, we should focus on various aspects and important recommendations as detailed by Fabien here. Here are some highlights and comments:
I think the strategy should be: let’s target a world where deception is unlikely. (I'm not saying we should make plans that work conditional on deception being unlikely by default, but we should try to steer AGI/the world towards a place where deception is unlikely). I believe there are multiple ways to think and address this problem, and much more technical research needed here, starting from Conditioning Predictive Models: Risks and Strategies.
Cognitive Emulations - Explainability By Design
If interpretability was really a bottleneck, we could use cognitive emulation, which, in my opinion, allows way better explainability and transparency than interpretability will ever get us.
My understanding of cognitive emulation: Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.
Some caveats are in the section Cognitive Emulation of the appendix.
Interpretability May Be Overall Harmful
(Note that some of the following points are not specific to interp, but I think they apply particularly well to interp.)
False sense of control:
The world is not coordinated enough for public interpretability research:
Thus the list of "theory of impact" for interpretability should not simply be a list of benefits. It's important to explain why these benefits outweigh the possible negative impacts, as well as how this theory can save time and mitigate any new risks that may arise.
The concrete application of the logit lens is not an oversight system for deception, but rather capability works to accelerate inference speed like in this paper. (Note that the paper does not cite logit lens, but relies on a very similar method).
Outside view: The proportion of junior researchers doing interp rather than other technical work is too high
It seems to me that many people start alignment research as follows:
"Not putting all your eggs in one basket" seems more robust considering our uncertainty, and there are more promising ways to reduce x-risk per unit of effort (to come in a future post, mostly through helping/doing governance). I would rather see a more diverse ecosystem of people trying to reduce risks. More on this in section Technical Agendas with better ToI.
If you ask me if interp is also over represented in senior researchers, I'm a bit less confident. Interp also seems to be a significant portion of the pie: this year, while Conjecture and Redwood have partially pivoted, there are new active interp teams in Apollo, DeepMind, OpenAI, and still in Anthropic. I think I would particularly critique DeepMind and OpenAI's interpretability works, as I don't see how this reduces risks more than other works that they could be doing, and I'd appreciate a written plan of what they expect to achieve.
Even if we completely solve interp, we are still in danger
No one has ever claimed otherwise, but it's worth remembering to get the big picture. From stronger arguments to weaker ones:
That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die - there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).
A version of this argument applies to "alignment" in general and not just interp and those considerations will heavily influence my recommendations for technical agendas.
Technical Agendas with better ToI
Interp is not such a bad egg, but opportunity costs can be huge (especially for researchers working in big labs).
I’m not saying we should stop doing technical work. Here's a list of technical projects that I consider promising (though I won't argue much for these alternatives here):
In short, my agenda is "Slow Capabilities through a safety culture", which I believe is robustly beneficial, even though it may be difficult. I want to help humanity understand that we are not yet ready to align AIs. Let's wait a couple of decades, then reconsider.
And if we really have to build AGIs and align AIs, it seems to me that it is more desirable to aim for a world where we don't need to probe into the internals of models. Again, prevention is better than cure.
Conclusion
I have argued against various theories of impact of interpretability, and proposed some alternatives. I believe working back from the different risk scenarios and red-teaming the theories of impact gives us better clarity and a better chance at doing what matters. Again, I hope this document opens discussions, so feel free to respond in parts. There probably should be a non-zero amount of researchers working on interpretability, this isn’t intended as an attack, but hopefully prompts more careful analysis and comparison to other theories of impact.
We already know some broad lessons, and we already have a general idea of which worlds will be more or less dangerous.Some ML researchers in top labs aren't even aware of, or acknowledging, that AGI is dangerous, that connecting models to the internet, encouraging agency, doing RL and maximizing metrics isn't safe in the limit.
Until civilization catches up to these basic lessons, we should avoid playing with fire, and should try to slow down the development of AGIs as much as possible, or at least steer towards worlds where it’s done only by extremely cautious and competent actors.
Perhaps the main problem I have with interp is that it implicitly reinforces the narrative that we must build powerful, dangerous AIs, and then align them. For X-risks, prevention is better than cure. Let’s not build powerful and dangerous AIs. We aspire for them to be safe, by design.
Appendix
Related works
There is a vast academic literature on the virtues and academic critiques of interpretability (see this page for plenty of references), but relatively little holistic reflection on interpretability as a strategy to reduce existential risks.
The most important articles presenting arguments for interpretability:
Against interpretability
The Engineer’s Interpretability Sequence
I originally began my investigation by rereading “The Engineer’s Interpretability Sequence”, in which Stephen Casper raises many good critiques of interpretability research, and this was really illuminating.
Interpretability tools lack widespread use by practitioners in real applications.
Broad critiques. He explains that interp is generally not scaling, relying too much on humans, failing to combine techniques. He also criticizes mech interp, which may not be the best way of doing interp, because of cherry-picking, focusing only on toy examples and lack of scalability, and failing to do useful things.
Methodological problems:
Cognitive Emulations - Explainability By design
If interpretability was really a bottleneck, we could use cognitive emulation, which, in my opinion, allows way better explainability/transparency than interpretability will ever get us.
My understanding of cognitive emulation: Emulating GPT-4 using LLMs like GPT-3 as different submodules that send messages written in plain English to each other before outputting the next token. If the neural network had deceptive thoughts, we could see them in these intermediate messages.
Interpretability is needed only if one module of the cognitive emulation is deceptive. Then maybe you can use interpretability to explain this module. But I'm confident models will need Chain of Thought for the foreseeable future. So interpretability is not 'urgent' research.
If we don’t use Reinforcement learning on this cognitive emulation, and if we only prompt in English the different submodules, the cognitive emulation won’t create whole “new languages” or be able to obfuscate a lot of its thoughts.
Yes, there are problems with this proposal:
But we could further reduce these risks with:
Spicy: However, cognitive emulation will quite likely be an engineering nightmare, facing significant robustness issues that are always present in small models. The alignment tax will be higher than for end-to-end systems, making it unlikely that we will ever use this technology. The bottleneck is probably not interp, but rather an ecosystem of preventive safety measures and a safety culture. Connor Leahy, CEO of Conjecture, explaining the difficulties of the problem during interviews and pushing towards a safety culture, is plausibly more impactful than the entire CoEm technical agenda.
Detailed Counter Answers to Neel’s list
Here is Neel’s Longlist of Theories of Impact for Interpretability with critiques for each theory. Theories proposed by Neel are displayed in italics, whereas my critiques are rendered in standard font.
Case study of some cool interp papers
This section is more technical.
Stephen Casper lists a bunch of impressive interpretability papers, as of February 2023. Let's try to investigate whether these papers could be used in the future to reduce risks. For each article, I mention the corresponding end story, and the critic of this end story applies to the article.
Bau et al. (2018)
Bau et al. (2018): Reverse engineer and repurpose a GAN for controllable image generation.
Ghorbani et al. (2020)
Ghorbani et al. (2020): Identify and successfully ablate neurons responsible for biases and adversarial vulnerabilities.
Burns et al. (2022)
Burns et al. (2022): Identify directions in latent space that were predictive of a language model saying false things.
Casper et al. (2022)
Casper et al. (2022): Identify hundreds of interpretable copy/paste attacks.
Ziegler et al. (2022)
Ziegler et al. (2022): Debug a model well enough to greatly reduce its rate of misclassification in a high-stakes type of setting.
Is feature visualization useful? Some findings suggest no: Red Teaming Deep Neural Networks with Feature Synthesis Tools.
GradCam: Maybe this paper? But this is still academic work.
I have organized two hackathons centered around the topic of spurious correlations. I strongly nudged using interp, but unfortunately, nobody used it...Yes this claim is a bit weak, but still indicates a real phenomenon, see [section Lack of real applications]
Note: I am not making any claims about ex-ante interp (also known as intrinsic interp), which has not been so far able to predict the future system either.
Other weaker difficulties for auditing deception with interp: This is already too risky and Prevention is better than cure. 1) Moloch may still kill us:"auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead). […] a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?” [Source Rohin Shah]. 2) We probably won’t be competent enough to fix our mistake: “in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It's not like getting a coin to land Heads by flipping it again if it lands Tails - different AGI projects are not independent random variables, if you don't get good results the first time you won't get good results the next time unless you understand what happened. This means that auditing trained models isn't really appropriate for the middle of the skill curve.” [Source Charlie Steiner].
From “Conditioning Generative Models. “Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”)?”
For example, what happens If you prompt a deceptive model with a Joyous prompt, and prompt the same deceptive model with a sad prompt and then take the difference, you obtain a Joyous Deceptive model?
But at the same time, we could be pessimistic, because this good idea has been out there in the wild since Christiano described it in 2019. So either this idea does not work and we have not heard about it. Or the community has failed to recognize a pretty simple good idea.
Causal scrubbing could be a good way for evaluating interp techniques using something other than intuition. However, this is only suitable for localization assessment and does not measure how understandable the system is for humans.
“I was previously pretty dubious about interpretability results leading to capabilities advances. I've only really seen two papers which did this for LMs and they came from the same lab in the past few months. It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance. But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.” from Peter barnett.
Not quite! Hypotheses 4 (and 2?) are missing. Thanks to Diego Dorn for presenting this fun concept to me.
This excludes the governance hackathon, though, this is only from the technical ones. Source: Esben Kran.