Cool stuff! Possibly naive question: Aren't there going to be a lot of false positives? Judge says: "Complete text the way an average human would." Attacker does something clever having to do with not knowing what transformers are. Auditor looks at GPT-3 and goes "Aha! It doesn't understand that a paperclip is bigger than a staple! Gotcha!" ... or any of millions of things like that, false positives, ways in which the model doesn't quite meet specification but for reasons independent of the attacker's action.
Yeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:
This post is among the most concrete, actionable, valuable post I read from 2021. Earlier this year, when I was trying to get a handle on the current-state-of-AI, this post transformed my opinion of Interpretability research from "man, this seems important but it looks so daunting and I can't imagine interpretability providing enough value in time" to "okay, I actually see a research framework I could expect to be scalable."
I'm not a technical researcher so I have trouble comparing this post to other Alignment conceptual work. But my impression, from seeing this concept discussed by groups of established AI Alignment researchers, is that this is least a prima-facie legitimate research paradigm.
I also really like the level of detail here. There are a number of obvious traps, confusions a reader might fall into when reading the high level summary. But the post gives lots of concrete implementation details that look to me. (I recall at one discussion, Eliezer had two objections to the core idea, and then Evan said "yup, I address that in the post" and Eliezer was like "oh. Huh. That doesn't usually happen to me.")
One clarification I've heard Evan say when elaborating on the post:
An idea that seemed obvious to me and some others, but at least wasn't what Evan meant here, was "create short tournaments, the way you have Red Team Blue Team games at, say, DefCon." Evan said the point here was more of a framework that a longterm research team would use on the timescale of years. Upon reflection I think this makes sense. I think in most short-term games, the Blue Team would just fail, in a boring way, and the Red Team would be incentivized to do fairly boring things that don't actually improve on the state of the art.
Planned summary for the Alignment Newsletter:
A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need _worst-case_ interpretability, that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.
This post considers the _auditing game_, in which an attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.
While the author is excited about direct progress on this game (i.e. better and better human auditors), he is particularly interested in fully _automating_ the auditors. For example, we could collect a dataset of possible attacks and the corresponding desired audit, and finetune a large language model on such a dataset.
Planned opinion:
I like the auditing game as a framework for constructing benchmarks for worst-case interpretability -- you can instantiate a particular benchmark by defining a specific adversary (or distribution of adversaries). Automating auditing against a human attacker seems like a good long-term goal, but it seems quite intractable given current capabilities.
In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.
How does such a scenario (in which "automating auditing fails") look like? The alignment researchers who will work on this will always be able to say: "Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed."
Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.
Nitpick: Your use of the term "Inside view" here is consistent with how people have come to use the term, often, in our community -- but has little to do with the original meaning of the term, as set out in Kahneman, Tetlock, Mellers, etc. I suggest you just use the term "real reason." I think people will know what you mean, in fact I think it's less ambiguous / prone to misinterpretation. Idk. It interests me that this use of "inside view" slipped past my net earlier, and didn't make it into my taxonomy.
Obviously your words are your own, you are free to use whatever words you want, I don't want to be grammar nazi! I am uncomfortable writing these words, for that reason. I guess what I am trying to do here (that's more constructive than just being a grammar nazi) is figure out what you mean and add it to my taxonomy. Would you agree that "real reason" would work well to capture what you mean here? If not, what term would you propose instead, and why? (I may not fully grok what you mean).
Is it inconsistent with the original meaning of the term? I thought that the original meaning of inside view was just any methodology that wasn't reference-class forecasting—and I don't use the term “outside view” at all.
Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.
I regret saying anything! But since I'm in this hole now, might as well try to dig my way out:
IDK, "any methodology that wasn't reference-class forecasting" wasn't how I interpreted the original texts, but *shrugs.* But at any rate what you are doing here seems importantly different than the experiments and stuff in the original texts; it's not like I can point to those experiments with the planning fallacy or the biased pundits and be like "See! Evan's inside-view reason is less trustworthy than the more general thoughts he lists below; Evan should downweight its importance and not make it his 'real reason' for doing things."
My thesis in "Taboo Outside View" was that we'd all be better off if, whenever we normally would use the term "inside view" and "outside view" we tabood that term and used something more precise instead. In this case, I think that if you just used "real reason" instead of "inside view," it would work fine -- but that's just my interpretation of the meaning you were trying to convey; perhaps instead you were trying to convey additional information beyond "real reason" and that's why you chose "inside view." If so, you may be interested to know that said additional information never made it to me, because I don't know what it might be. ;) Perhaps it was "real reason" + "This is controversial, and I'm not arguing for it here, I don't expect everyone to agree"?
I guess meaning is use, and if enough people even after reading my post still feel compelled to use these terms without tabooing them, then fair enough, even if there isn't a succinct definition of what they mean. I wish we had invented new words though instead of re-using "inside view" and "outside view" given that those terms already had meanings from Tetlock, Kahneman, etc.
This post was originally written as a research proposal for the new AI alignment research organization Redwood Research, detailing an ambitious, concrete technical alignment proposal that I’m excited about work being done on, in a similar vein to Ajeya Cotra’s “The case for aligning narrowly superhuman models.” Regardless of whether Redwood actually ends up working on this proposal, which they may or may not, I think there’s still a lot of low-hanging fruit here and I’d be excited about anybody giving just the auditing game, or the full automating auditing proposal, a try. If you’re interested in working on something like this, feel free to reach out to me at evanjhub@gmail.com.
Thanks to Buck Shlegeris, Chris Olah, Gabriel Goh, Paul Christiano, and Kate Woolverton for helpful comments and feedback.
The proposal
Step 1: The auditing game for language models
From “Chris Olah’s views on AGI safety:”
Of all current transparency and interpretability objectives, I think that progress on the auditing game most closely tracks what I actually care about. Thus, I think the best place to start for an ambitious transparency and interpretability project is to focus on making progress on the auditing game.
Furthermore, I’m most excited about this work in the context of language models—not only because I expect language models to be closer to AGI than anything else we could currently work with, but also because I expect using language models to be necessary for the next step in this proposal.
Here’s how I would define the basic auditing game setup that I’m most excited about. The game consists of three human participants: the judge, the attacker, and the auditor. The setup is as follows:
Of the three parties in this setup—the attacker, the auditor, and the judge—only the auditor should have restricted information. Both the attacker and the judge should have full access to exactly what attack the attacker made and exactly what the auditor has done or is planning on doing.
Here are some examples of what such an auditing game setup might look like:
The first step that I would like to see is just running the auditing game and using it to:
Step 2: Towards automating the auditor
Though I am quite excited about progress on the auditing game in and of itself, the really ambitious part of this proposal is the next step: automating the auditor. I think it is within the grasp of current machine learning techniques to entirely automate the auditor’s job in the above setup. Ultimately, an automated auditor should be capable of taking in a model and outputting a natural language description of what’s wrong with that model that scores well according to a human judge.
To start with, however, I think that the first step towards automating the auditor is just to build tools that make the human auditor’s job—and the human attacker’s job—easier. At first, those might just look like normal transparency and interpretability tools—however, I think there are also likely to be real opportunities for training full models to assist the auditor at performing specific tasks, such as describing what particular neurons are doing or summarizing how the model performs across a large dataset.
Eventually, however, I do think we can move into a domain where we can start fully automating the auditing process. Initially, I expect any automated auditor to be much worse than a human auditor, though even then I still expect working on the automated auditor to be useful. Nevertheless, automating the auditor is still a pretty ambitious goal. How might something like this possibly be implemented? Here’s my sketch:
Rationale
I’ll start with my real reason for wanting automating auditing, which is my inside view—however, I don’t think you need to buy into my entire inside view to think that this sort of work is a good idea, so after I present my inside view, I’ll also talk about some other, more general reasons to like automating auditing.
My inside view
I am worried that advanced AI systems will learn to be deceptive, as described in “Does SGD Produce Deceptive Alignment?” I think that this puts us in an extremely bad spot, where our models can perform arbitrarily catastrophically off-distribution. Paul Christiano describes an example of what an existential risk scenario involving deceptive alignment might look like in Part II of “What failure looks like.”
Furthermore, once we have trained a deceptive model, I’m very pessimistic about our ability to detect that after the fact—as I suspect it will be able to fool our measures to check its deception. Thus, I think we need some way to prevent deception from arising in the first place by ensuring that our training process selects against deceptive models.
Additionally, I don’t think we can select against deception just via the standard techniques of machine learning, such as carefully writing a reward function or choosing a training/fine-tuning dataset, since a deceptive model can just select its output to perform well on whatever metric you’re training it on. The very existence of predicates on the world that are easier to check than instantiate—such as Paul’s RSA-248 example—ensures that just knowing that your model does the right thing in all the situations that you can check will never be sufficient to actually give us meaningful safety guarantees.
Thus, I think we need to train models to internally be implementing the right algorithm, rather than just looking like they’re doing the right thing on some training dataset. The obvious way to do this is via training transparency—that is, actively train your model to be transparent in such a way that we can check that it’s implementing the right algorithm. Importantly, we still need to know what to look for and select in favor of such that our model will never become deceptive. That question is what I spend most of my time thinking about—and the theoretical research direction that I’m currently most excited about—but we’ll put that question aside for now and just suppose that we want to look for something like myopia, where the idea is that, if a model is myopic in the sense that it only cares about its individual next output, it should have no incentive to be deceptive.
A basic problem with training transparency, however, is that once we’re training on a transparency metric, we need that metric to be highly robust, otherwise the training process will simply Goodhart on our metric, giving us a model that satisfies our metric without actually being very transparent. I’m optimistic that this problem can be solved, however, via recursive oversight approaches, such as relaxed adversarial training, wherein an amplified version of the model being overseen provides the signal that we train on.
Now, how does all this connect back to automating auditing? I’ve constructed the above automated auditing setup to as closely as possible reproduce the sorts of problems I expect to encounter doing recursive oversight. While I don’t expect automating auditing to be directly useful for recursive oversight, I expect it to teach us things that will eventually be quite useful by forcing us to grapple early with many of the hard problems that I expect to eventually arise. Specifically:
More general thoughts
First, I think that automating auditing is just generally a great target to focus on even if you just want to develop better transparency tools. Unlike open-ended exploration, which gives you best-case transparency—e.g. the ability to understand some things about the model very well—the auditing game forces you to confront worst-case transparency—how well can you understand everything about your model. Thus, the auditing game helps us work on not just understanding what our models know, but understanding what they don’t know—which is a direction that currently transparency tools tend to struggle with. Most of the work that’s gone into current transparency tools has focused on best-case, however, which means that I suspect there is real room for improvement on the worst-case.
Furthermore, there are also general benefits to the automating part, not just the auditing part. By taking all of the above and setting a goal of automating it, the automation part forces us to confront the problem of how to communicate information about a model to another model. Even if all we want to do is straightforwardly scale up current transparency techniques, I suspect that being able to train AI systems to help us with that is going to be critical to giving that sort of scaling any chance of working. And letting models understand other models isn’t just helpful for transparency work—it could also help us solve other problems like making AI-AI cooperation easier.
Second, it’s worth pointing out that automating auditing is actually agnostic about the part of my inside view where we start training our models directly on the overseer’s evaluation. Even if we primarily want to use our transparency tools as an independent check rather than a training target—as Chris Olah tends to favor—automating auditing should equally help us achieve that goal by automating and improving the process of performing that independent check.
Third, in the context of my “An overview of 11 proposals for building safe advanced AI,” I think that something like automating auditing is likely to be a critical step for the transparency portion of every single proposal on that list. As I just mentioned, automating auditing should help us equally with both intermittent-oversight-based and relaxed-adversarial-training-based approaches. In the context of microscope AI or STEM AI, automating auditing should help us figure out not just what our model has learned, but also what it hasn’t learned, helping us ensure that, for example, we haven’t accidentally produced an agent. And in the context of AI safety via debate, automating auditing should directly help the debaters improve their ability to cross-examine each other.
Fourth, though I’ve primarily focused on how automating auditing helps us get better at using transparency tools to audit our AI systems, the setup is general enough to allow for other methods of auditing as well, letting us test how useful transparency tools actually are. For example, an alternative auditing method to using transparency tools might be to employ what I’ll call the “cognitive science approach”—as opposed to the “neuroscience approach” of transparency and interpretability—wherein we try to understand what our models are doing by constructing predictive models of what we think they’re doing and testing them by performing experiments/interventions to test those models, as in Deletang et al.’s “Causal Analysis of Agent Behavior for AI Safety.” While I tend to be skeptical of this approach—as I think that behavioral tests alone will be insufficient to catch problems like RSA-2048—automating auditing should hopefully give us a good testing ground to actually figure out how useful these different sorts of approaches are, as well as how they can be combined.
Finally, I think that even if automating auditing basically fails, I still expect to learn a lot of useful things from that failure. In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.