Prover-Estimator Debate: 
A New Scalable Oversight Protocol

Geoffrey Irving

I was really impressed by the technical work in this paper. Getting to a formalization of the problem setup and the protocol that allows you to prove meaningful things is a big accomplishment.

However, as the authors mention above, I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight. (I think the discussion in this post is fine, but I think the title of the paper "avoiding obfuscation with prover-estimator debate" is a bit misleading. I believe the authors are going to change this in v2.)

I'm excited about more empirical work on making debate protocols work in practice. I feel a bit less excited about pure theory work, but I think work that mixes experimentation and theory could be helpful.

I think there are broadly two classes of hope about obfuscated arguments:

(1.) In practice, obfuscated argument problems rarely come up, due to one of:

It’s difficult in practice to construct obfuscated arguments for arbitrary propositions
1. It’s definitely plausible to me that the algorithm sketch I gave here for constructing obfuscated arguments for arbitrary propositions doesn’t work, given some relatively weak assumptions about the structure of the debaters’ knowledge
For anything the debaters know, there’s usually a sufficiently compact and stable honest argument such that the honest debater can win by giving this argument
1. It seems pretty easy to give counterexamples here - e.g. Paul gives the example of unsupervised translation. But maybe these are relatively rare. It's plausible that the honest debaters in the human experiments I did where we ran into obfuscation / instability problems weren't using the right tactics (i.e. they could have constructed more stable arguments that used independent lines of evidence more)

(2.) We can create a protocol that distinguishes between cases where:

(not obfuscatable) the debate mirrors the structure of the debaters’ understanding such that they would be able to easily identify which subtree contains a flaw if one was present
(obfuscatable) they don't or wouldn't know which subtree contains the flaw.

This is the way (apart from argument size) in which the primality test example differs from the obfuscated factorization example: the debaters have some high-level mathematical concepts which allow them to quickly determine whether some proposed lemma is correct.

This wouldn't get us to full ELK (bc maybe models still know things they have no human-understandable arguments for), but would at least expand the class of honest arguments that we can trust to include ones that are large + unstable in human-understandable form but where the debaters do have a faster way of identifying which subtree to go down.

[-]Geoffrey Irving5mo20

I agree with this! On the empirical side, we're hoping to both get more human participant experiments to happen around debate, and to build more datasets that try to probe obfuscated arguments. The dataset aspect is important, as I think in the years since the original paper follow-on scalable oversight experiments (debate or not) have been too underpowered in various ways to detect the problem, which then results in insufficient empirical work getting into the details.

[-]Beth Barnes5mo56

Yep. For empirical work I'm in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. "did it get the correct answer with very high reliability") as opposed to "did it outperform a baseline by a statistically significant margin" where you then end up needing high n and therefore each example needs to be cheap / shallow

[-]Geoffrey Irving5mo20

I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you're up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of "have a high standard for efficacy" was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.

[-]Beth Barnes5mo50

Yep, happy to chat!

[-]Jacob Pfau5mo10

I don't think this should be a substantial update about whether obfuscated arguments are a problem for recursive decomposition approaches to scalable oversight.

Can you elaborate on why you think this? Perhaps you were already aware that obfuscated arguments should only pose a problem for completeness (in which case for the majority of obfuscated-arguments-aware people I'd imagine the new work should be a significant update.) Or perhaps your concern is that in the soundness case the refutation is actually inefficient and the existence of small Bob circuits demonstrated in this paper is too weak to be practically relevant?

Regarding (1.2) I also see this a priority to study. Basically, we already have a strong prior from interacting with current models that LMs will continue to generalise usably from verifiable domains to looser questions. So, the practical relevance of debate hinges on how expensive it is to train with debate to increase the reliability of this generalisation. I have in mind here questions like "What experiment should I run next to test this debate protocol".

One dataset idea to assess how often stable arguments can be found is to curate 'proof advice' problems. These problems are proxies for research advice in general. Basically:

Question: "Does this textbook use theorem Y to prove theorem X?", where the debaters see the proof but the judge does not.

(Note that this is not about proof correctness, a dishonest debater can offer up alternative correct proofs that use non-standard techniques.) This setup forces arguments based on heuristics and "research taste"—what makes a proof natural or clear. It's unclear what independent evidence would look like in this domain.

[-]Beth Barnes5mo80

I can write more later, but here's a relevant doc I wrote as part of discussion with Geoffrey + others. Maybe the key point from there is that I don't think this protocol solves the examples given in the original post describing obfuscated arguments. But yeah, I was always thinking this was a completeness problem (the original post poses the problem as distinguishing a certain class of honest arguments from dishonest obfuscated arguments - not claiming you can't get soundness by just ignoring any arguments that are plausibly obfuscated.)

[-]Geoffrey Irving5mo70

On the AISI side, we would very excited to collaborate on further research! If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants up to £200,000, from UK AISI directly here.

[-]Gurkenglas5mo50

What about when A knows where the flaw is but B doesn't?

[-]Geoffrey Irving5mo12

The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training.

[-]Gurkenglas5mo*30

Sure, and I expect there are reasons for assuming that A can't hide a flaw from B, e.g. maybe one can prove that honest and dishonest can't be distinguished if it can, but I don't see this mentioned anywhere.

[-]Geoffrey Irving5mo12

I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we've been intentionally living in the "Alice and Bob have similar heuristics and computational power" for years: the set in the paper models the set of heuristics, but I am surprised that the term "heuristic" ended up not in the final text (we just speak a lot about relative computational resources).

Fundamentally if you let B attend to either A's thoughts or to A's context and also ensure A and B are similar, we think A shouldn't be able to hide things from B. If A is a very different model, or a stronger model, this won't hold, and we should expect arguments which A can navigate in ways B can't defend against. So in order to get anything like this to go through you need some kind of "B knows what A knows" invariant.

[-]Charlie Steiner5mo40

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

[-]Geoffrey Irving5mo30

One way to think about amplification or debate is that they're methods for accelerated evaluation of large computations: instead of letting the debaters choose where in the computation to branch, you could just take all branches and do the full exponential work. Then safety splits into

1. Are all perturbations of the unaccelerated computation safe?
2. If we train for debate, do we get one of those?

If humans are systematically biased, this can break (1) before we get to (2). It may still be possible to shift some of the load from the unaccelerated computation to the protocol by finding protocols that are robust to some classes of systematic error (this post discusses that). This is a big issue, and one where we'll be trying to get more work to happen. A particular case is that many organisations are planning to use scalable oversight for automated safety research, and people love to be optimistic that new safety schemes might work.

[-]Thomas Kwa5mo20

What's an example of a claim it might be difficult/impossible to find a stable argument for?

[-]Beth Barnes5mo50

IMO the requirements are a combination of stability and compactness - these trade off against each other, and the important thing is the rate at which you get evidence for which debater is dishonest while exploring the tree.

iiuc, the stability definition used here is pretty strong - says that the error in the parent is smaller than the largest error across the children. So any argument structure where errors can accumulate (like a conjunctive argument, or a proof which requires all the steps to be correct) is out.

[-]Geoffrey Irving5mo10

The requirements are stability, compactness, and A-provability (meaning that the first player Alice knows how to correctly answer claims). It's important that A-probability is a requirement, as otherwise you can do silly things like lifting up to multilinear extensions of your problem over finite fields, and then there will always been lots of independent evidence which can be turned into stability.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

42

Prover-Estimator Debate: A New Scalable Oversight Protocol

42

The Prover-Estimator Debate Protocol

Completeness

Soundness

Future research