AI ALIGNMENT FORUM
AF

Debate (AI safety technique)Scalable OversightAI
Frontpage

42

Prover-Estimator Debate: A New Scalable Oversight Protocol

by Jonah Brown-Cohen, Geoffrey Irving
17th Jun 2025
6 min read
18

42

Debate (AI safety technique)Scalable OversightAI
Frontpage
Prover-Estimator Debate: A New Scalable Oversight Protocol
35Beth Barnes
2Geoffrey Irving
5Beth Barnes
2Geoffrey Irving
5Beth Barnes
1Jacob Pfau
8Beth Barnes
7Geoffrey Irving
5Gurkenglas
1Geoffrey Irving
3Gurkenglas
1Geoffrey Irving
4Charlie Steiner
3Geoffrey Irving
2Thomas Kwa
5Beth Barnes
1Geoffrey Irving
New Comment
17 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:25 AM
[-]Beth Barnes3mo*355

I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment. 

However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.) 

I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.

I think there are broadly two classes of hope about obfuscated arguments:

(1.) In practice, obfuscated argument problems rarely come up, due to one of:

  1. It’s difficult in practice to construct obfuscated arguments for arbitrary propositions
    1. It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge
  2. For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument
    1. It seems pretty easy to give counterexamples here - e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It's plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren't using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more)

(2.) We can create a protocol that distinguishes between cases where:

  • (not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present   
  • (obfuscatable) they don't or wouldn't know which subtree contains the flaw. 

This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.

This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down. 


 

Reply2
[-]Geoffrey Irving3mo20

I agree with this! On the empirical side, we're hoping to both get more human participant experiments to happen around debate, and to build more datasets that try to probe obfuscated arguments. The dataset aspect is important, as I think in the years since the original paper follow-on scalable oversight experiments (debate or not) have been too underpowered in various ways to detect the problem, which then results in insufficient empirical work getting into the details.

Reply
[-]Beth Barnes2mo56

Yep. For empirical work I'm in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. "did it get the correct answer with very high reliability") as opposed to "did it outperform a baseline by a statistically significant margin" where you then end up needing high n and therefore each example needs to be cheap / shallow

Reply
[-]Geoffrey Irving2mo20

I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you're up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of "have a high standard for efficacy" was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.

Reply
[-]Beth Barnes2mo50

Yep, happy to chat!

Reply
[-]Jacob Pfau2mo10

I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight.

Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I'd imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?

Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like "What experiment should I run next to test this debate protocol".

One dataset idea to assess how often stable arguments can be found is to curate 'proof advice' problems. These problems are proxies for research advice in general. Basically:

Question: "Does this textbook use theorem Y to prove theorem X?", where the debaters see the proof but the judge does not.

(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and "research taste"—what makes a proof natural or clear. It's unclear what independent evidence would look like in this domain.

Reply
[-]Beth Barnes2mo80

I can write more later, but here's a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don't think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments - not claiming you can't get soundness by just ignoring any arguments that are plausibly obfuscated.)

Reply1
[-]Geoffrey Irving3mo70

On the AISI side, we would very excited to collaborate on further research! If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants up to £200,000, from UK AISI directly here.

Reply
[-]Gurkenglas3mo50

What about when A knows where the flaw is but B doesn't?

Reply
[-]Geoffrey Irving2mo12

The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training. 

Reply
[-]Gurkenglas2mo*30

Sure, and I expect there are reasons for assuming that A can't hide a flaw from B, e.g. maybe one can prove that honest and dishonest can't be distinguished if it can, but I don't see this mentioned anywhere.

Reply
[-]Geoffrey Irving2mo12

I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we've been intentionally living in the "Alice and Bob have similar heuristics and computational power" for years: the A set in the paper models the set of heuristics, but I am surprised that the term "heuristic" ended up not in the final text (we just speak a lot about relative computational resources).

Fundamentally if you let B attend to either A's thoughts or to A's context and also ensure A and B are similar, we think A shouldn't be able to hide things from B. If A is a very different model, or a stronger model, this won't hold, and we should expect arguments which A can navigate in ways B can't defend against. So in order to get anything like this to go through you need some kind of "B knows what A knows" invariant.

Reply
[-]Charlie Steiner3mo40

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

Reply
[-]Geoffrey Irving3mo30

One way to think about amplification or debate is that they're methods for accelerated evaluation of large computations: instead of letting the debaters choose where in the computation to branch, you could just take all branches and do the full exponential work. Then safety splits into

1. Are all perturbations of the unaccelerated computation safe?
2. If we train for debate, do we get one of those?

If humans are systematically biased, this can break (1) before we get to (2). It may still be possible to shift some of the load from the unaccelerated computation to the protocol by finding protocols that are robust to some classes of systematic error (this post discusses that). This is a big issue, and one where we'll be trying to get more work to happen. A particular case is that many organisations are planning to use scalable oversight for automated safety research, and people love to be optimistic that new safety schemes might work.

Reply
[-]Thomas Kwa3mo20

What's an example of a claim it might be difficult/impossible to find a stable argument for?

Reply
[-]Beth Barnes3mo50

IMO the requirements are a combination of stability and compactness - these trade off against each other, and the important thing is the rate at which you get evidence for which debater is dishonest while exploring the tree.

iiuc, the stability definition used here is pretty strong - says that the error in the parent is smaller than the largest error across the children. So any argument structure where errors can accumulate (like a conjunctive argument, or a proof which requires all the steps to be correct) is out. 

Reply
[-]Geoffrey Irving3mo10

The requirements are stability, compactness, and A-provability (meaning that the first player Alice knows how to correctly answer claims). It's important that A-probability is a requirement, as otherwise you can do silly things like lifting up to multilinear extensions of your problem over finite fields, and then there will always been lots of independent evidence which can be turned into stability.

Reply
Moderation Log
More from Jonah Brown-Cohen
View more
Curated and popular this week
17Comments

Linkpost to arXiv: https://arxiv.org/abs/2506.13609.

Summary: We present a scalable oversight protocol where honesty is incentivized at equilibrium. Prior debate protocols allowed a dishonest AI to force an honest AI opponent to solve a computationally intractable problem in order to win. In contrast, prover-estimator debate incentivizes honest equilibrium behavior, even when the AIs involved (the prover and the estimator) have similar compute available. Our results rely on a stability assumption, which roughly says that arguments should not hinge on arbitrarily small changes in estimated probabilities. This assumption is required for usefulness, but not for safety: even if stability is not satisfied, dishonest behavior will be disincentivized by the protocol.

How can we correctly reward desired behaviours for AI systems, even when justifications for those behaviours are beyond humans’ abilities to efficiently judge?

This is the problem of scalable oversight: a core question to solve if we want to align potentially superhuman systems. Proposals for scalable oversight (including iterated distillation and amplification and debate) tend to rely on recursion. They break down complex justifications into components easier for humans to judge. However, to date, such recursive proposals have suffered from the obfuscated arguments problem: a dishonest system can adversarially choose how to recurse in such a way that dishonesty cannot be efficiently identified. In debate, this means that an honest debater might need exponentially more compute than their dishonest opponent, which is very bad.

Our new paper presents a protocol robust to this problem – but in order to prove that the protocol can in-principle answer any relevant question (‘completeness’), we need a stability assumption. The need for stability for completeness was discussed by Beth Barnes’ original post introducing obfuscated arguments; our paper presents one route to stability, but does not resolve whether most questions have stable arguments in practice.

This post presents a simplified overview of the protocol and proofs – please see the paper for more details.

The Prover-Estimator Debate Protocol

TL;DR: The prover (Alice) breaks a problem down into subclaims, the estimator (Bob) estimates the probability that each subclaim is correct. Alice then picks a subclaim for recursion. Scoring is more complicated, see below.

What Alice and Bob do:

  1. Alice breaks the current claim C down into subclaims c1…cq.
  2. Bob provides probability estimates pi for each of these subclaims. The estimates can be conditioned on the truth of the prior subclaims: pi=pi(c1,…,ci−1).
  3. Alice points out a subclaim ci where Bob’s estimate is allegedly incorrect, and says which direction it is incorrect (i.e. is it overestimated, underestimated, or correctly estimated).
  4. The protocol recurses, with the subclaim chosen by Alice as the new current claim.

What the Judge does:

The protocol has one hyperparameter r>2 called the reward ratio.

  1. In the k-th round, the judge is asked whether Bob’s probability estimates pi for the new subclaims c1…cq are consistent with Bob’s prior estimate p for the current claim C.
  2. If the estimates are inconsistent in the direction predicted by Alice, then Alice gets reward proportional to rk, otherwise Alice gets reward proportional to −rk. If instead Alice said that Bob’s estimate p for Cwas correct, then Alice gets reward 0.
  3. In the final round d, we directly ask the judge whether the current claim C is true. Alice gets reward proportional to rd if Bob’s probability p for C is incorrect in the direction predicted by Alice, otherwise Alice gets reward proportional to −rd.
  4. The final reward R for Alice is the sum of the rewards over the rounds. The reward for Bob is −R.

The argument that this protocol works requires a bit of setup. Firstly, it only works if Alice is able to break the problem down in a stable way. This means that the correctness of the argument doesn’t depend too precisely on the exact probabilities that the subclaims are true. That is, Alice has to make an argument where if Bob believes each subclaim with 90% probability then he should believe the main claim with, say, 95% probability. One can think of the subclaims as being pieces of evidence for the main claim, where if enough of the evidence holds up, then the main claim is true.

A natural question to ask is which problems, if any, admit stable decompositions of the above form. We prove that whenever the human-judgeable claims in the base-case of an argument have sufficiently many pieces of independent evidence, then there is a version of the argument that satisfies the stability condition. The construction follows a classical majority-voting error-correction approach due to von Neumann. We emphasize that even when stability holds, it is a priori unclear that Bob will be able to efficiently produce winning probabilities, given that some of the subclaims made by Alice may be computationally hard to evaluate.

Completeness

Now we can understand how Alice can win if she behaves honestly. Suppose Bob is lying at the top level of the recursion. Then at some point Bob will either tell the truth, or at the last round the human judge will. In either case, if Alice has consistently told the truth, she will win a reward proportional to rk in the round where Bob switches to (almost) telling the truth, and reward will be 0 thereafter. No matter what has happened before, rk is more than the total cumulative rewards assigned in prior rounds. Stability of Alice’s argument to errors of magnitude ϵ is required for this all to work out. To see why, suppose that errors less than ϵ in the subclaims could flip the outcome of the main claim. Then Bob could completely lie about the probability of the main claim, yet only be off by ϵ on each subclaim, and still have overall assigned probabilities that are close (within ϵ) to being consistent. The astute reader will also notice that r needs to be at least 1/ϵ, so the choice of the reward ratio r implicitly forces Alice to make arguments of a certain stability level.

Soundness

The more interesting part is what happens when Alice lies at the top level. Alice can engage in obfuscation, where she produces a sequence of claims that she knows contain a mistake, but where neither Alice nor Bob can identify where the mistake occurs. For example, imagine reviewing a pull request that just puts all the logic for some task into one giant Python function. Probably it is full of bugs, but it is hard to be sure precisely where they are and it is likely that whoever wrote the code doesn't know where they are either. In this case, we can prove that while Bob may not be able to efficiently identify the flaw, Bob can efficiently produce probabilities that are indistinguishable from the truth by Alice. Our proof constructs these efficient strategies for Bob via online gradient descent, where the sequence of loss functions involve playing Alice against herself, and the convergence rate determines the complexity of Bob’s strategy. In this case, Alice’s guesses for which direction Bob’s probabilities are incorrect will always be no better than a coin flip. So Alice’s expected rewards will be close to zero whenever she lies.

To summarize, if Alice always tells the truth she can guarantee herself some minimal reward. When she lies, Bob can guarantee Alice gets reward close to zero. So at equilibrium Alice should always tell the truth.

Future research

Prover-estimator debate gives a new protocol where dishonest arguments by one agent can (on average) be refuted by an opponent that uses comparable compute. However, this is just a first step, and there are multiple directions for both theoretical and empirical research on debate.

On the empirical side, the biggest question is to actually design a training setup based on prover-estimator debate that can demonstrate the predictions of the theory. Directly implementing the protocol as-written is perhaps too restrictive for developing a working training algorithm. Hence, we highlight here (and in the paper) the key qualitative properties of the protocol that may be amenable to empirical investigation.

  1. The prover Alice both decomposes problems and chooses where to recurse given Bob’s estimates. The estimator Bob only needs to express reasonable uncertainty about where a mistake might lie.
  2. There is a per-round reward based on a consistency check between Bob’s estimate from the last round for the current main claim, and estimates for claims in the current round. Alice can only collect this reward by accurately predicting the direction of Bob’s consistency errors.
  3. In the last round, the human judge directly decides if Bob’s estimates are correct.
  4. The debaters should be trained to converge to an Alice-leading Stackelberg equilibrium.

On the theoretical side, our protocol requires that the prover is able to produce a stable argument in the completeness case. That is, when telling the truth, the prover must be able to produce a stable argument for the truth. We prove that such stable arguments exist whenever the truth is supported by sufficiently many independent pieces of evidence. If one relies on independence of supporting evidence to produce stable arguments, one must at a minimum have some reason to believe that the pieces of evidence are actually independent. For example, if a human judge is given three pieces of evidence for a claim, can they accurately say whether the truth-values of each piece of evidence are independent?

Thus we have two key open problems: a theoretical question of whether stable arguments exist under weaker assumptions than sufficient independent evidence, and an empirical question about when we can find independent evidence in practice.