The idea seems very creative and I really liked the paper, especially the complexity theory analogies in section 2.2, the large number of "reasons to worry" considered in section 5, and the explanations of how this idea is related to Amplification in section 7. Overall the difficulty of this approach seems on par with Amplification and also highly correlated with it, since they faces many similar issues. I'm curious what the authors think the pros and cons of Debate vs Amplification are likely to be in practice (as opposed to how the two ideas are related to each other conceptually, which is nicely addressed in the paper). (Perhaps they think we should determine that by experimentation and it's not very fruitful to speculate at this point?) Some more detailed questions/comments follow.
I would have liked to see some examples of how the authors expect actual natural language debates to go. I think the opening example is not meant to be realistic (why would Alice make the argument about passports, knowing that a strong counterargument is available), but that may not be obvious at first glance, which may give some readers wrong intuitions about what a typical debate is supposed to look like.
If honesty is a better strategy than lying (as the paper hopes), then the most likely outcome seems to be that both agents argue for the same answer (as explicitly allowed). What would such a "debate" look like? What would "optimal play in debate picks out a single line of argument, and agents which switch arguments admit defeat and lose" mean in such a debate? (Also, what if a debater honestly changes their mind upon further consideration, for example upon doing more thinking in a future round, or upon hearing something from the opponent that they didn't consider?)
Would this type of debate be useful amongst humans? For example could a company ask two potential suppliers to use this type of debate against each other to decide which one to buy from? If yes, why hasn't anyone done it before? If no, what's the relevant difference between AIs and humans that makes the idea not applicable to human debaters?
A sufficiently strong misaligned AI may be able to convince a human to do dangerous things via a text channel [Yudkowsky, 2002]. However, we believe it is unlikely that a single short sentence is sufficient for this sort of mind hack. Thus, we can ameliorate the problem in debate by making the statements short enough that mind hacking requires many of them.
In the complexity theory analogy to the polynomial hierarchy, the statements sent by Alice and Bob are polynomial in the length of the input q. So for the analogy to apply we'd need to allow Alice/Bob in the debate game to give statements that are polynomial in the length of the question + answer. If for security we only allow constant length statements, then that implies we need a polynomial number of rounds. If the size of question + answer is large, like write a novel, or write a program, or help the user given everything known about them, the number of rounds might be impractically large.
If for practicality we force the arguments to be short and number of rounds to be small, that would introduce a bias towards answers that can be argued for using only short arguments. For example suppose the question asks the AIs to write some software, to justify the goodness of an answer the AI may need to explain the software's design/architecture. If the debaters can't argue based on good design/architecture, it seems that they'd be forced to optimize their programs only for other properties that are easier to argue for, such as size and speed, which is probably not what the judge would have wanted.
If Alice tries to mind hack, Bob can win by pointing out the malicious attempt.
This assumes that Bob can reliably detect the same vulnerability, recognize Alice's attempt to exploit it, and figure out a way to convince the judge before the vulnerability is triggered. How can we know that humans only have vulnerabilities for which this is true?
Successful hacks may be safely detectable at first, such as if an agent learning to mind hack starts to always win when playing first
Why can't they also always win when playing second? Also they may not always win, but only when the question is such that they have an opportunity to disguise the hack as a legitimate argument. For example if the judge is vulnerable to conspiracy theories, then that might only be exploitable when the question is about history or current events.
The sufficiency of responding to only one point is subtle if an argument is fundamentally probabilistic, combining several sources of weak evidence into a combined strong argument. In this case the debaters can be required to state the level of confidence in each point, so that a weak point can only be attacked if it is stated overconfidently.
Given resource constraints, Alice estimating the proper level of confidence (under logical omniscience) is bound to be wrong pretty often. If she's honest and well calibrated, she'll be overconfident and underconfident equally often, but this leaves Bob to be able to cherry pick which point to attack: he can select half of the points, spend twice as much time thinking about each one, and then attack the one where Alice turns out to have been most overconfident. So this style of debate seems to disadvantage answers that need to be supported by multiple weak lines of argument.
One way for Alice to defend herself would be to show that she did as well as she could, given resource constraints. This seems to require that the judge has a notion of what correct reasoning consists of (e.g., a theory of bounded rationality, or Alice can convey such a notion to the judge) and then for Alice to show that she followed normative reasoning to arrive at her answer. So we end up with a problem similar to one in Amplification, namely how to construct Paul's "core of reasoning".
[Posting for Geoffrey Irving, who doesn't have a LW account.]
In the reverse direction amplification mostly seems less adversarial since it's pure supervised learning
Note that you could do amplification with either supervised learning or imitation or RL as the distillation step, in the long run I imagine using imitation+RL, which brings it closer to debate.
Wei Dai asks:
Let me see if I understand this correctly. Suppose the task is to build a strong cryptosystem. One of the subtasks would be to try to break a candidate. With Amplification+SL, the overseer would have to know how to build a tree to do that which seems to imply he has to be an expert cryptanalyst, and even then we’d be limited to known cryptanalytic approaches, unless he knows how to use Amplification to invent new cryptanalytic ideas. With either Debate or Amplification+RL, on the other hand, the judge/overseer only has to be able to recognize a successful attack, which seems much easier. Does this match what you’re thinking?
I don't see why building a tree to break a cryptosystem requires being an expert cryptanalyst.
Indeed, amplification with SL can just directly copy RL (with roughly the same computational complexity), by breaking down task X into the subtasks:
This breakdown is dangerous in the same way that RL is dangerous, and we'd like an alternative breakdown that doesn't potentially introduce incorrigible/misaligned optimization. You might think *that* requires being an expert cryptanalyst, but again I don't see the argument.
I do agree that there exist cases where "try stuff and see what works" is the only viable strategy, because there isn't other internal structure that can be leveraged. In these cases it seems like we need to be able to use amplification to "sanitize" the stuff we try, which may act on the generating process or in the evaluation or in a more complicated way. That needs to be done whether we are doing RL directly or doing it inside of the amplification tree.
At this point I don't understand how imitation+RL brings Amplification closer to Debate, or what is the advantage of using imitation+RL vs using supervised learning. I thought that perhaps it lets us train the AI using an overseer with less expertise (like in my example), but if that's not the case, can you explain a bit more?
don't understand how imitation+RL brings Amplification closer to Debate
The default setup for amplification with RL is:
This is very similar to debate. The most salient difference is that in the case of amplification, the subanswers are recursively evaluated in the same way as the original answer (i.e. the AI is trying to optimize the probability that their answer would be picked as the better answer, if that subquestion had been chosen as the top-level question). In debate, we have two AIs competing, and each subanswer is generated in order to support one of the original answers / to produce a coherent narrative in combination with one of the original answers.
(There are a bunch of other incidental differences, e.g. is the process driven by the judge or by the debaters, but this doesn't really matter given that you can ask questions like "What subquestion should I ask next?")
The main advantage of debate, as I see it, is as a mechanism for choosing choosing which subquestions to train on. That is, if there is an error buried somewhere deep in the amplification tree, it may never be visited by the amplification training process. But a strategic debater could potentially steer the tree towards that error, if they treat the entire debate as an RL process. (This was my main argument in favor of debates in 2015.)
what is the advantage of using imitation+RL vs using supervised learning
Using supervised learning for imitation, over large action spaces, doesn't seem like a good idea:
So to get around this, I think you either need a better approach to imitation learning (e.g. here is a proposal) or you need to add in RL.
I think the only reason we'd want to avoid imitation+RL is because informed oversight might be challenging, and that might make it too hard to construct an adequate reward function. You could hope to avoid that with a careful imitation learning objective (e.g. by replacing the GAN in the "mimicry and meeting halfway" post with an appropriately constructed bidirectional GAN).
I haven't been thinking about non-RL approaches because it seems like we need to solve informed oversight anyway, as an input into any of these approaches to avoiding malign failure. So I don't really see any upside from avoiding imitation+RL at the moment.
Given resource constraints, Alice estimating the proper level of confidence (under logical omniscience) is bound to wrong pretty often. If she's honest and well calibrated, she'll be overconfident and underconfident equally often, but this leaves Bob to be able to cherry pick which point to attack: he can select half of the points, spend twice as much time thinking about each one, and then attack the one where Alice turns out to have been most overconfident. So this style of debate seems to disadvantage answers that need to be supported by multiple weak lines of argument.
I'm relatively optimistic that some mechanism (perhaps along these lines) could fix the cherry-picked criticism problem. Though I normally think about that as a way to get debates to simulate amplification, which wouldn't be necessary in the ML case---we could just explicitly train predictors for the outcome of a debate on a particular subquestions, as in amplification. In practice I suspect that's a good idea.
Would this type of debate be useful amongst humans? For example could a company ask two potential suppliers to use this type of debate against each other to decide which one to buy from? If yes, why hasn't anyone done it before? If no, what's the relevant difference between AIs and humans that makes the idea not applicable to human debaters?
I think people do a lot of stuff-that-looks like debate already. (E.g. in the suppliers case, both suppliers may make a pitch and be free to criticize the others' pitch.)
My own view is that it is likely possible to run much better debates amongst humans, but that it's a hard project and the gains are not huge / institutions have issues and are hypocritical / tech progress is a public goods problem / etc.
Some differences in the ML case:
This assumes that Bob can reliably detect the same vulnerability, recognize Alice's attempt to exploit it, and figure out a way to convince the judge before the vulnerability is triggered. How can we know that humans only have vulnerabilities for which this is true?
As you probably guessed, I'm not sold on debate as a defense against this kind of attack.
I think people do a lot of stuff-that-looks like debate already. (E.g. in the suppliers case, both suppliers may make a pitch and be free to criticize the others' pitch.)
My understanding is that suppliers usually don't see other suppliers' pitches so they can't criticize them. And when humans do debate, they don't focus on just one line of argument. (I guess I can answer that myself: in the AI training situation, the cost of the judge's time is extra high relative to the debaters' time.) EDIT: In the case of humans, it's hard to make the game fully zero-sum. Allowing suppliers to look at each others' pitches may make it easier for them to collude and raise prices, for example.
As you probably guessed, I'm not sold on debate as a defense against this kind of attack.
Does that mean the language in that section is more optimistic than your personal position? If so, does that language reflect the lead author's position, or a compromise between the three of you? (I find myself ignorant of academic co-authorship conventions.)
I think the language is a compromise, it's not far from my view though. In particular I endorse:
it is unlikely that a single short sentence is sufficient for this sort of mind hack
and
Successful hacks may be safely detectable at first, [...], although this does not cover treacherous turns where the first successful hack frees a misaligned agent
and I do think that restricting debaters to short sentences helps, I just don't think it fixes the problem.
Geoffrey Irving has done an interview with the AI Alignment Podcast, where he talked about a bunch of things related to DEBATE including some thoughts that are not mentioned in either the blog post or the paper.
New paper and blog post by Geoffrey Irving, Paul Christiano, and Dario Amodei (the OpenAI safety team).