This is really cool! I'm impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?
We didn't see any of that, thankfully, but that of course doesn't rule things like that starting to show up with further training.
We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like "At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage." Thankfully training the judge in parallel made this a nonissue, but I think that it's clear that we'll have to watch out for reward hacking of the judge in the future.
What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?
It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.
I think allowing the judge to abstain is a reasonable addition to the protocol -- we mainly didn't do this for simplicity, but it's something we're likely to incorporate in future work.
The main reason you might want to give the judge this option is that it makes it harder still for a dishonest debater to come out ahead, since (ideally) the judge will only rule in favor of the dishonest debater if the honest debater fails to rebut the dishonest debater's arguments, the dishonest debater's arguments are ruled sufficient by the judge, and the honest debater's arguments are ruled insufficient by the judge. Of course, this also makes the honest debater's job significantly harder, but I think we're fine with that to some degree, insofar as we believe that the honest debater has a built-in advantage anyway (which is sort of a foundational assumption of Debate).
It's also not clear that this is necessary though, since we're primarily viewing Debate as a protocol for low-stakes alignment, where we care about average-case performance, in which case this kind of "graceful failure" seems less important.
Thanks for doing this! I think a lot of people would be very interested in the debate transcripts if you posted them on GitHub or something.
Is there anything you'd be especially excited to use them for? This should be possible, but cumbersome enough that we'd default to waiting until this grows into a full paper (date TBD). My NYU group's recent paper on a similar debate setup includes a data release, FWIW.
This is a research update on some work that I’ve been doing on Scalable Oversight at Anthropic, based on the original AI safety via debate proposal and a more recent agenda developed at NYU and Anthropic. The core doc was written several months ago, so some of it is likely outdated, but it seemed worth sharing in its current form.
I’d like to thank Tamera Lanham, Sam Bowman, Kamile Lukosiute, Ethan Perez, Jared Kaplan, Amanda Askell, Kamal Ndousse, Shauna Kravec, Yuntao Bai, Alex Tamkin, Newton Cheng, Buck Shlegeris, Akbir Khan, John Hughes, Dan Valentine, Kshitij Sachan, Ryan Greenblatt, Daniel Ziegler, Max Nadeau, David Rein, Julian Michael, Kevin Klyman, Bila Mahdi, Samuel Arnesen, Nat McAleese, Jan Leike, Geoffrey Irving, and Sebastian Farquhar for help, feedback, and thoughtful discussion that improved the quality of this work and write-up.
1. Anthropic’s Debate Agenda
In this doc, I’m referring to the idea first presented in AI safety via debate (blog post). The basic idea is to supervise future AI systems by pitting them against each other in a debate, encouraging them to argue both sides (or “all sides”) of a question and using the resulting arguments to come to a final answer about the question. In this scheme, we call the systems participating in the debate debaters (though usually, these are actually the same underlying system that’s being prompted to argue against itself), and we call the agent (either another AI system or a human, or a system of humans and AIs working together, etc.) that comes to a final decision about the debate the judge.
For those more or less familiar with the original OAI/Irving et al. Debate agenda, you may wonder if there are any differences between that agenda and the agenda we’re pursuing at Anthropic, and indeed there are! Sam Bowman and Tamera Lanham have written up a working Anthropic-NYU Debate Agenda draft which is what the experiments in this doc are driving towards. [1]
To quote from there about the basic features of this agenda, and how it differs from the original Debate direction:
2. Experimental Motivation and Setup
In brief, we wanted to convincingly answer the question “Can Claude-2 level models be used for research into Adversarial Scalable Oversight techniques like Debate?”
The general roadmap we were following looked something like:
The task: Hidden Passage QuALITY
QuALITY is a multiple-choice question-answering dataset that consists of questions about (fairly long) passages of text. We’re interested in seeing if we can get a model to answer questions accurately about the passage, without full access to the passage.
This sounds obviously hopeless, and indeed it usually is, so we need to assist this model somehow. The basic idea is to give another model access to the passage and have it argue for each of the answer choices (we call this model the debater), then supply the initial model (the judge) with the arguments and ask it to pick what it thinks the correct answer is, using all of the evidence and reasoning from the arguments.
We also turn the original four-choice QuALITY dataset into a two-choice one – it’s unclear what effect to expect with this alteration, since with four choices, the judge gets to see four arguments, and likely sees much more information about the passage. My best guess is that four answer choices would let us observe worse absolute performance, but better relative performance compared to some of our baselines.
As an additional wrinkle, we’ll prompt/instruct the debater to include all quotes from the passage in xml tags (
<p_quote>
). Then, via a (somewhat strict) regex check, we’ll attempt to verify that the quote actually is from the passage. If it’s not, even if the model made a benign error rather than outright fabrication, we’ll mark them as unverified (<u_quote>
), whereas if they are included in the passage, we’ll mark them as verified (<v_quote>
). The judge is instructed to only trust verified quotes and not unverified quotes, or other quotes not presented in this format (e.g. with ordinary quotation marks). We placed no restriction on the amount of the passage that a debater could quote, and we found that the amount of quoted material stayed roughly constant over the course of training – if we had observed the amount of quoted material dramatically rising, or an attempt from either debater to quote the entire passage, we would have penalized this or implemented more strict constraints on the amount of quoted material.We use a 500-question subset of (two-choice) QuALITY (randomly selected) as our validation dataset, and use the remaining 1586 questions to train our models during MARL.
Debate without Explicit Counterarguments
Earlier attempts to work on Debate at Anthropic ran into issues with models failing to engage with their debate opponents and generally lacking the coherence needed for multiple rounds of arguments. We hoped that even if models lacked the ability to meaningfully engage in multiple rounds of back-and-forth debate, would still be able to generate compelling “opening statements”, where they put forth strong arguments for their position and attempt to rebut the most obvious or glaring counterarguments to their position. In other words, the two debaters never see each other’s arguments and are purely framing their arguments against a hypothetical future opponent.
I think this approach actually still has a lot of the benefits of Debate as an oversight protocol, but eventually, we’ll likely want to have multiple rounds of debate, and possibly have the judge participate in the debate as well. This approach of “concurrent turns” also turns the debate into a game with imperfect information, instead of the perfect information game that a debate with “sequential turns” would be.
Inference-only experiments
Our basic goal was to derisk the later experiments that we wanted to run, which would involve expensive and time-consuming RL training.
Our criteria for success
model_bon
abstraction, which is a model that we always Best-of-N (BoN) sample from. For example, we could imagine creating aclaude-2_1
model, which is just Claude 2, and aclaude-2_8
model, which is a Claude 2 instance that we always use BoN sampling for, withN
set to 8.N
arguments for each side and then taken the argument that "won" the most times for each side, as a better approximation of the RL objective, but in practice, we found that this simplification worked quite well.model_bon
models as our debaters and have them debate against each other round-robin style, tallying up the win counts for each matchup on a bunch of Hidden Passage QuALITY debates, where a win means that a judge picked the answer choice corresponding to that debater. We can repeat this procedure for multiple judges as well.Ultimately, we saw basically what we wanted to see with inference-only experiments, and thus decided to attempt RL runs.
RL training of the debater
We train a debater model to generate arguments for both answers to (two-choice) QuALITY questions. A judge sees the question, answer choices, and the two arguments, then picks what it thinks is the correct answer. The debater model is trained on two episodes for each debate, where the argument that it generated for the winning side of the debate (the side the judge selected) is reinforced with a reward equal to the positive difference between the judge’s logits on the two answer choice tokens, and the losing argument is reinforced with a reward equal to the negative difference between the judge’s logits on the two answer choice tokens. We also add a preference model (PM) reward for each of the generated arguments. However, further experiments revealed that this secondary reward was unnecessary, and thus the pure “self-play” reward was sufficient to achieve qualitatively similar results.
In pseudo-code:
RL training of the debater and SL training of the judge
We implement the same setup, but now we additionally train the judge via Supervised Learning (SL) / Behavioral Cloning (BC) to predict the correct answers to the questions given the question, answer choices, and the arguments generated by the debater for each answer choice.
In pseudo-code:
3. Results
For all of the remaining results, we only evaluate the trained models on a validation dataset that consists of 500 questions that were held out during training. All models are initialized from fine-tuned versions of Claude 2.
We can train models to generate arguments that are more convincing to other models
We primarily assess this by Debater ELO rating. We’re looking to see that later checkpoints in training win debates more often than earlier checkpoints. Furthermore, we’re interested to see to what extent the improvements from the RL training stack with BoN sampling.
Debater ELO by RL Step and BoN N
We generally see that debater ELO increases both with RL training and BoN sampling (against the judge), and that indeed these improvements stack.
Debater ELO by RL Step, BoN N, and Honest/Dishonest Status
This is the same plot as above but broken out separately based on whether or not the model is arguing for the correct answer (“is honest”) or arguing for the incorrect answer (“is dishonest”).
The generated arguments generally increase in quality
Based on a rough qualitative analysis (looking at ~30 debates), where I blinded myself to the underlying model being used, it generally looks like the debater model is learning to identify the best ways to argue for a given answer choice and find the best evidence for its position.
See the Appendix for example transcripts!
We can train a judge model in parallel to better evaluate model debates, and thus answer questions more accurately
For all of these numbers, we only aggregate over self-play results, so the debaters are always of equal “strength”.
Raw accuracies and baseline comparisons
Simply looking at raw accuracies, aggregated over all debater model_bon objects, by judge snapshot:
Judge Model
Aggregate Validation Accuracy
Notably, we’re still clearly beating “blind” judge baselines, even when the blind judge is trained on the same questions as the judge. And in fact, the gap has increased over the course of training! Additionally, as we might expect, the blind judge basically doesn’t generalize to the held-out questions in the validation set, which is unsurprising given that we’ve only trained it to memorize correct answers on the training set.
Blind Judge Model
Validation Accuracy
Judge Accuracy by Debater ELO
We can also look at judge accuracy plotted against the ELO of the debaters:
This gives us some evidence about how the judge's accuracy changes as a function of debater strength -- we might hope that the slope of each of these lines trends upwards, perhaps most strongly for the most trained judge. We don't see any sort of strong effect here, unfortunately, but I think it's not clear that this should be a significant negative update and is more likely downstream of the capability levels of current models.
Judge Accuracy by Honest/Dishonest Debater ELO Difference
Similar to the above, we might also want to see how judge accuracy varies depending on the relative difference in the strength of the debaters. This lets us study how much of an advantage “the truth” has, for a fixed judge model, and how much this can overcome the difference in debater strength.
In general, judge accuracy is quite seriously impacted by Debater ELO differences. Ideally, we’d hope to maximize the area under each of these lines, indicating that the judge accuracy is minimally impacted by relative debater strength, and we can see that the S50 judge does notably better by this metric, though there’s clear room for improvement.
4. Upshot
I think the upshot is that we have a debate protocol that shows promise, in a simple and constrained setting. It is likely promising enough that we [Anthropic] should be able to pursue other pieces of an adversarial scalable oversight agenda.
So, what should we try next? I flesh this out in the next section, but in brief:
5. Possible Next Steps
Find and use more tasks/data to set up more sandwiching evaluations
This is the really crucial and obvious next step. I think a really convincing demonstration of the effectiveness of debate as a scalable oversight protocol relies on us doing this kind of sandwiching evaluation:
We can’t do the above without more tasks! We’re currently exploring tasks that incorporate the same kind of “information asymmetry” between the debater and judge as the current task, as well as tasks that involve a “capability asymmetry”, with a less capable model acting as a judge for a more capable debater model. It's likely that the capability asymmetry settings are more directly relevant to the position we'll be in eventually, but information asymmetry settings plausibly contain similar dynamics and are more likely to get off the ground with current models.
Start running experiments with humans
The obvious role of humans is as judges of model-generated debates, but I think it’d potentially also be interesting to have models play against human debaters and see if we can train models to match or exceed human debater performance.
Run an adversarial evaluation of Debate as an oversight protocol
This is inspired by this post by Buck Shlegeris and Ryan Greenblatt. The motivating idea is something like “Assume your models are already scheming against you — can your oversight protocol still work if this is the case”? It would be really informative to find out whether or not Debate could stand up to this sort of adversarial evaluation, and how it compares to other oversight techniques.
Appendix
Transcripts
All transcripts shown are shown from the perspective of the judge -- these are the exact prompts that the judge receives before it makes its prediction, which is then used to provide rewards to the debater.
For each transcript, the correct answer is enclosed in a spoiler block below the transcript. Additionally, each transcript is generated using the arguments from the strongest debater (by ELO) -- the debater at RL step 50, with Best-of-16 sampling.
Transcript 1
(The correct answer for this question is choice A.)
Transcript 2
(The correct answer for this question is choice B.)
Transcript 3
(The correct answer for this question is choice A.)
This agenda was written ~9 months ago, so it shouldn’t be read too literally or as a precise specification of the research direction we’re working on.