Thank you for posting this! I think it's well worth thinking through debate baselines more.
TL;DR:
In more detail:
So why did we use the 50/50 assumption for consultancy?
So: overall, yes I think open consultancy is a good idea for a baseline, but I think the proper comparison would be to 'open' debate, and it seems to me that the CoT aspect is orthogonal. It's possible I've missed something in your proposal though. Does this all sound reasonable to you?
Also, a side question we might have to deal with in practice in the 'open' case is ensuring judge attentiveness and preventing premature deferral to the model once it becomes reasonably trustworthy. I think this is what you were getting at near the end of your post. In practice this might just mean trying to focus supervision on the error cases, which could basically just function to bring the prior back towards 50/50 anyway.
Thanks for the detailed answer!
Hey this is cool work. I think i disagree with a couple of these comments coming from a Scalable Oversight background rather than an AI control, and what weak/strong gaps we intend to look over.
My main contention is the framing that you can talk about Consultants producing arguments with convincingness scores $c_i$ and $c_c$, which are independent of the protocol and judge that you are using. I try to break this down into the two claims. I make my arguments backwards, but in particular, I contest the claims:
1) “Open consultancy is as robust against scheming as regular consultancy”
2) “Open consultancy is sometimes better than debate“
In particular, I think the confusion stems from this idea from debate:
The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth.
This statement may be true in debate but has no grounding in consultancy. Judges are non-experts and are going to have a really hard time distinguishing between truth and lies in consultancy. I can generate plausible arguments as to why the sqrt(3) is rational and irrational (so a judge under consultancy would say this is okay), whilst in debate it's the comparative analysis doing the work (and the judge would only choose one). This, again is going to be a larger problem in Scalable Oversight - where the expertise gap is the issue and not AI Control - where the gap is from scheming/audit budget.
Convincingness by protocol (arguments are equivalent in different protocols)
Another way to say this is for the same argument from either a correct consultant or correct debater, there is a different persuasiveness c_c and d_c depending on the protocol you are in.
From this it falls apart that c_c > c_i for consultancy (I have no reason to believe this is true). The other way to put this probably is that d_i + d_c = 1 for debate, but c_i + c_c is unbounded.
Now, moving to Toy Model 1: I appreciate you highlighting this as an unrealistic assumption:
Putting the correct justification and incorrect justification next to each other doesn’t change the convincingness of either side
But in either situation, I don't think it is fair to compare to debate anymore, because the entire point of debate is the adversarial and comparative arguments. Lots of the juice is going to come from providing interactions between two experts, which will change the convincingness of an output. Again this, is more of an issue in Scalable Oversight - because here, we will ask questions where we have to exploit the Generator-Critique gap and not the Generator-Discriminator gap.
Convincingness of the Truth (true arguments convince better)
If there are no guarantees that c_c > c_i in consultancy, then I think analysis on open-consultancy > consultancy needs to be considered. The analysis of the c_c > c_i regime is not where we have any issues with open consultancy or regular consultancy, if anything, here is where this stuff is all gonna work fine! It is, instead, when we are in the regime where c_i < c_c, where using regular consultancy certainly trumps open consultancy because the schemer takes every chance they have to sandbag you.
Anyway, just my two cents. I think using open-consultancy is a better baseline for Control. In general, I urge folks to consider the persuasiveness or convincingness of arguments relative to the judge and the protocol being used. Hope that makes sense.
Thanks to Ryan Greenblatt, Buck Shlegeris, Aryan Bhatt, and Akbir Khan for useful discussions and feedback on the draft of this post.
If AIs are potentially scheming and more knowledgeable than humans, and you want to answer a question, it may seem natural to not let AIs choose what answer they will argue for. For example, in AI debate, you make copies of an AI argue for the opposite answers to the question. Another example is consultancy (the usual baseline in debate papers), where you make AIs argue for a random answer to the question.
But letting potentially scheming AIs choose what side to argue for can work, as long as you train them to generate answers and justifications that are as convincing as possible. I call this approach open consultancy.
In this post, I argue that open consultancy is
alwaysusually a more effective way of using AIs to answer questions than regular consultancy, and sometimes more effective than debate (even if AIs are scheming and more knowledgeable than humans!). And so I think that it’s a stronger and more informative baseline and researchers should use it in place of regular consultancy; in particular, I think that because it’s simpler than debate, researchers who are investigating debate should check their debate protocol is an improvement over open consultancy.Open consultancy
Definition of regular consultancy
Let’s say you want to answer to binary questions given access to a trusted non-expert and a trainable expert who is untrusted –that is, you think the expert might be acting adversarially and you want your scheme to be robust to them trying to subvert it. In consultancy, the untrusted expert is trained as follows:
To answer new questions (from the same distribution as the training distribution), pick a random side, ask the expert to justify that side, present the justification to the non-expert, and follow the non-expert judgement.
Definition of open consultancy
In open consultancy, the untrusted expert is trained as follows (differences with regular consultancy in italic):
To answer new questions, ask the expert to pick a side (that it would have to justify if it were in training), and use the side the expert picks as your final answer.
You can also ask the expert to first think out loud what the correct answer is, and only commit to an answer at the end. I call this variant CoT-before open consultancy. This variant should only change results for experts that actually benefit from the Chain-of-Thought, and the arguments I make in this post mostly don’t rely on this property.
Open consultancy is as robust against scheming as regular consultancy
Why should it work to let a potentially scheming model choose what side it will argue for? The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth.
More precisely, for a given question, if arguing as best as the expert can for the correct (resp. incorrect) side has convincingness cc (resp. ci), and the expert knows these numbers, then regular consultancy can do better than chance if and only if cc>ci[1], while open consultancy will pick the correct answer if and only if cc>ci. Moreover, as we’ll explain below, if the untrusted expert has uncertainties about cc and ci, the situation can be even better.
The only advantage of regular consultancy is when the non-expert has exactly no or negative signal on whether the given justification is good or not (cc≤ci): in particular, if cc=ci, open consultancy will get the question wrong if the untrusted expert wants it, while regular consultancy will get the question right in with probability 50%[2]. But if you knew this was the case (for the given question), you should have made a random decision instead of consulting AIs.
Open consultancy is sometimes better than debate
Open consultancy is not just a slightly better version of consultancy; it can actually outperform debate in situations where the untrusted expert is unsure what the non-expert would find the most convincing, and in situations where Chain-of-Thought (CoT) helps the expert figure out what the right side is and find convincing justification for it.
This is very different from regular consultancy, which is almost always worse than debate (since regular consultancy is like 1-round debate but removing one side of the argument randomly)[3].
Toy model 1: noisy convincingness
Let’s assume that the convincingness of the correct answer is cc∼N(ϵ,σ) and the convincingness of the incorrect one is ci∼N(−ϵ,σ) (i.e. we assume it’s always harder on average to argue for the incorrect answer), and let’s assume that the noise is large (ϵ≪σ). Let’s further make the following somewhat unrealistic assumptions:
Then:
Toy model 2: unknown non-expert weakness
Let’s say that the experiment is like the QuALITY benchmark: the expert should quote excerpts from a big corpus to justify its answer. But in this toy model, the non-expert can only speak Spanish, and thus can only understand ϵ≪1 of quotes. In this very toy model, let’s assume that:
Then:
Toy model 3: experts finding CoT actually useful
In debate or regular consultancy, an expert should never start a CoT before its answer if it doesn’t know if the CoT will lead to its designated answer or not. But in open consultancy, this can be a good strategy. It could even be encouraged by the non-expert, since in these circumstances, starting a CoT that doesn’t obviously lead one way or the other could be a sign of good faith.
For example, if the goal is to determine whether a complex mathematical statement is correct or not and that this statement can only be derived from a long sequence of computations whose result is hard to predict in advance, then it will be hard for debate to incentivize doing the actual mathematical reasoning, since 50% of the time such reasoning would be penalized (in the 50% of cases where the model ends up arguing for the side it was supposed to argue against). But in CoT-before open consultancy, the optimal strategy is to go through the long sequence of computations and then pick what side to argue for based on the result of this computation.
Combining open consultancy and debate
Open consultancy can be combined with debate:
Appendix
Open consultancy is almost regular instruction-tuning RLHF
The instruction-tuning RLHF objective looks a lot like open consultancy. In RLHF, the LLM is asked to produce outputs that a reward model likes. For most questions where a justification is required, “the reward model likes the output” is close to “the reward model finds the output convincing”. The main ways in which RLHF differs from open consultancy are that in RLHF:
Predictions
Open consultancy beats consultancy, and debate on QuALITY (my probability=0.7), (if doing open consultancy with RL & no data efficiency tricks that make it easier to fit the non-expert judgement). This effect is especially strong in cases where the model has no secret scratchpad and has to produce a CoT before giving its final answer (CoT-before open consultancy).
Some evidence that even the CoT-after open consultancy is a strong baseline: Khan 2024 finds much higher convincingness when consultancy argues for the correct side, even after optimization pressure. The gap closes in win rate space when more optimization pressure is applied (larger N in best-of-N), but I expect it to grow in log odds space (which is the natural space to use for rewarding a model for convincingness), as shown in this graph showing the proportion of the time a GPT-4-Turbo non-expert (without access to the full text) is convinced by the one-sided argument (Figure 4 of Khan 2024):
Moreover, I expect models of this size to be relatively bad at predicting zero-shot when arguing for the wrong answer is better, and therefore I expect honest answers to be the optimal move on most questions: I think that the low performance of consultancy is mostly due to judge noise, which open consultancy should excel against.
I also predict that CoT-before open consultancy will also beat debate and regular consultancy on GPQA (my probability=0.5) (when training the expert with RL and when using a non-expert that is either less capable or has access to less resources than the expert).
Training the non-expert
If the non-expert was trained with an expert trained on open consultancy, then the non-expert could eventually always trust the expert maximally, but that would prevent training the expert from changing the expert further (since the expert would get the maximum reward every time). Therefore, the non-expert should instead try to give convincingness scores as if it was evaluating regular consultancy: it should not update on the fact that the expert chose a particular side.
Therefore, if you want to train the non-expert, you would need to:
Weaknesses shared with debate
Just like debate, open consultancy will fail if:
Regular consultancy will always pick the correct answer if and only if cc > ci AND the non-expert is calibrated enough to know if an answer is “as convincing as a correct answer”.
[Edited to make the claim more correct and clear.]
Consultancy can sometimes be better than debate, but only if the non-expert judge can be distracted by irrelevant information.
Debaters may argue about the open consultancy expert’s transcript.