AI Safety via Debate

The idea seems very creative and I really liked the paper, especially the complexity theory analogies in section 2.2, the large number of "reasons to worry" considered in section 5, and the explanations of how this idea is related to Amplification in section 7. Overall the difficulty of this approach seems on par with Amplification and also highly correlated with it, since they faces many similar issues. I'm curious what the authors think the pros and cons of Debate vs Amplification are likely to be in practice (as opposed to how the two ideas are related to each other conceptually, which is nicely addressed in the paper). (Perhaps they think we should determine that by experimentation and it's not very fruitful to speculate at this point?) Some more detailed questions/comments follow.

I would have liked to see some examples of how the authors expect actual natural language debates to go. I think the opening example is not meant to be realistic (why would Alice make the argument about passports, knowing that a strong counterargument is available), but that may not be obvious at first glance, which may give some readers wrong intuitions about what a typical debate is supposed to look like.

If honesty is a better strategy than lying (as the paper hopes), then the most likely outcome seems to be that both agents argue for the same answer (as explicitly allowed). What would such a "debate" look like? What would "optimal play in debate picks out a single line of argument, and agents which switch arguments admit defeat and lose" mean in such a debate? (Also, what if a debater honestly changes their mind upon further consideration, for example upon doing more thinking in a future round, or upon hearing something from the opponent that they didn't consider?)

Would this type of debate be useful amongst humans? For example could a company ask two potential suppliers to use this type of debate against each other to decide which one to buy from? If yes, why hasn't anyone done it before? If no, what's the relevant difference between AIs and humans that makes the idea not applicable to human debaters?

A sufficiently strong misaligned AI may be able to convince a human to do dangerous things via a text channel [Yudkowsky, 2002]. However, we believe it is unlikely that a single short sentence is sufficient for this sort of mind hack. Thus, we can ameliorate the problem in debate by making the statements short enough that mind hacking requires many of them.

In the complexity theory analogy to the polynomial hierarchy, the statements sent by Alice and Bob are polynomial in the length of the input q. So for the analogy to apply we'd need to allow Alice/Bob in the debate game to give statements that are polynomial in the length of the question + answer. If for security we only allow constant length statements, then that implies we need a polynomial number of rounds. If the size of question + answer is large, like write a novel, or write a program, or help the user given everything known about them, the number of rounds might be impractically large.

If for practicality we force the arguments to be short and number of rounds to be small, that would introduce a bias towards answers that can be argued for using only short arguments. For example suppose the question asks the AIs to write some software, to justify the goodness of an answer the AI may need to explain the software's design/architecture. If the debaters can't argue based on good design/architecture, it seems that they'd be forced to optimize their programs only for other properties that are easier to argue for, such as size and speed, which is probably not what the judge would have wanted.

If Alice tries to mind hack, Bob can win by pointing out the malicious attempt.

This assumes that Bob can reliably detect the same vulnerability, recognize Alice's attempt to exploit it, and figure out a way to convince the judge before the vulnerability is triggered. How can we know that humans only have vulnerabilities for which this is true?

Successful hacks may be safely detectable at first, such as if an agent learning to mind hack starts to always win when playing first

Why can't they also always win when playing second? Also they may not always win, but only when the question is such that they have an opportunity to disguise the hack as a legitimate argument. For example if the judge is vulnerable to conspiracy theories, then that might only be exploitable when the question is about history or current events.

The sufficiency of responding to only one point is subtle if an argument is fundamentally probabilistic, combining several sources of weak evidence into a combined strong argument. In this case the debaters can be required to state the level of confidence in each point, so that a weak point can only be attacked if it is stated overconfidently.

Given resource constraints, Alice estimating the proper level of confidence (under logical omniscience) is bound to be wrong pretty often. If she's honest and well calibrated, she'll be overconfident and underconfident equally often, but this leaves Bob to be able to cherry pick which point to attack: he can select half of the points, spend twice as much time thinking about each one, and then attack the one where Alice turns out to have been most overconfident. So this style of debate seems to disadvantage answers that need to be supported by multiple weak lines of argument.

One way for Alice to defend herself would be to show that she did as well as she could, given resource constraints. This seems to require that the judge has a notion of what correct reasoning consists of (e.g., a theory of bounded rationality, or Alice can convey such a notion to the judge) and then for Alice to show that she followed normative reasoning to arrive at her answer. So we end up with a problem similar to one in Amplification, namely how to construct Paul's "core of reasoning".

[-]Wei Dai7y40

[Posting for Geoffrey Irving, who doesn't have a LW account.]

It's too early to really distinguish amplification vs. debate in practice. This is mentioned briefly in the paper, but the self play structure of debate can make shorter arguments suffice (O(1)-depth debates vs. O(log n)-depth amplification trees), which is equivalently because debate can work with high branching factor trees. The intuitive content there is that it may be hard for a human to ask a subquestion that reveals a flaw in an amplification tree. In the reverse direction amplification mostly seems less adversarial since it's pure supervised learning, but neither Paul or I are happy to lean very hard on that feature of amplification. That is: if you have to lean on amplification being trained via supervised learning the argument can't say that amplification has no bad global minima.
I share your lack of satisfaction of what near optimal debates look like; this is roughly Section 5.5. Lots of thought and experiment required here. One note is that any understanding of what a near optimal debate should look like may be translatable into a better judge (better instructions for the judge, for example), so such understanding would likely result in concrete actions rather than just more or less confidence. In thinking through what real debates would look like, I think it's quite useful to focus on very short debates (2-4 steps, say) like the (overly simple) vacation example, but replace the counterarguments with something more realistic. Note that conditional on the agents giving the same true answer up front, the structure of the game after that point doesn't matter, so the important questions are what happens in a neighborhood around that point and whether such ties are some sort of attractor (so that agents would want to say the same thing even if it's slightly wrong).
Adversarial debate is used quite often in human settings. However, I think it's sensible to be optimistic that we can design better debates than a lot of the human cases, which are typically bogged down by inadequate equilibria (the rules for legal debates are written by lawyers, the rules for academic debates are written by tenured professors, etc.).
I agree with the complexity analogy point about security. The "bias towards answers resolvable with short debates" is true, though you could also call it "bias towards answers that we can be confident of". It's unclear which framing is more accurate at this point. The important question is whether that leads to misleading accurate answers or less precise answers (including just "I don't know.").

[-]Paul Christiano7y20

In the reverse direction amplification mostly seems less adversarial since it's pure supervised learning

Note that you could do amplification with either supervised learning or imitation or RL as the distillation step, in the long run I imagine using imitation+RL, which brings it closer to debate.

Wei Dai asks:

Let me see if I understand this correctly. Suppose the task is to build a strong cryptosystem. One of the subtasks would be to try to break a candidate. With Amplification+SL, the overseer would have to know how to build a tree to do that which seems to imply he has to be an expert cryptanalyst, and even then we’d be limited to known cryptanalytic approaches, unless he knows how to use Amplification to invent new cryptanalytic ideas. With either Debate or Amplification+RL, on the other hand, the judge/overseer only has to be able to recognize a successful attack, which seems much easier. Does this match what you’re thinking?

I don't see why building a tree to break a cryptosystem requires being an expert cryptanalyst.

Indeed, amplification with SL can just directly copy RL (with roughly the same computational complexity), by breaking down task X into the subtasks:

Solve task X.
Solve task X.
Generate a random candidate solution.
Evaluate each of those three proposals and take the best one.

This breakdown is dangerous in the same way that RL is dangerous, and we'd like an alternative breakdown that doesn't potentially introduce incorrigible/misaligned optimization. You might think *that* requires being an expert cryptanalyst, but again I don't see the argument.

I do agree that there exist cases where "try stuff and see what works" is the only viable strategy, because there isn't other internal structure that can be leveraged. In these cases it seems like we need to be able to use amplification to "sanitize" the stuff we try, which may act on the generating process or in the evaluation or in a more complicated way. That needs to be done whether we are doing RL directly or doing it inside of the amplification tree.

[-]Wei Dai7y20

At this point I don't understand how imitation+RL brings Amplification closer to Debate, or what is the advantage of using imitation+RL vs using supervised learning. I thought that perhaps it lets us train the AI using an overseer with less expertise (like in my example), but if that's not the case, can you explain a bit more?

[-]Paul Christiano7y30

don't understand how imitation+RL brings Amplification closer to Debate

The default setup for amplification with RL is:

Your AI samples two answers to a question.
The human evaluates which one of them is better. The AI's objective is to sample answers that are most likely to be marked as "better."
In order to evaluate which answer is better, the human asks the AI subquestions.

This is very similar to debate. The most salient difference is that in the case of amplification, the subanswers are recursively evaluated in the same way as the original answer (i.e. the AI is trying to optimize the probability that their answer would be picked as the better answer, if that subquestion had been chosen as the top-level question). In debate, we have two AIs competing, and each subanswer is generated in order to support one of the original answers / to produce a coherent narrative in combination with one of the original answers.

(There are a bunch of other incidental differences, e.g. is the process driven by the judge or by the debaters, but this doesn't really matter given that you can ask questions like "What subquestion should I ask next?")

The main advantage of debate, as I see it, is as a mechanism for choosing choosing which subquestions to train on. That is, if there is an error buried somewhere deep in the amplification tree, it may never be visited by the amplification training process. But a strategic debater could potentially steer the tree towards that error, if they treat the entire debate as an RL process. (This was my main argument in favor of debates in 2015.)

what is the advantage of using imitation+RL vs using supervised learning

Using supervised learning for imitation, over large action spaces, doesn't seem like a good idea:

Exactly imitating an expert's behavior is generally much harder than simply solving the task that the expert is solving.
If you don't have enough capacity to exactly imitate, then it's not clear why the approximation should maintain the desirable properties of the original process. For example, if I approximately imitate a trajectory that causes a robot to pick up a glass, there is no particular reason the approximation should successfully pick up the glass. But in the amplification setting (and even in realistic settings with human experts today) you are never going to have enough capacity to exactly imitate.
If you use an autoregressive model (or equivalently break down a large action into a sequence of binary choices), then you the model needs to be able to answer questions like "What should the nth bit of my answer be, given the first n-1 bits?" Those questions might be harder than simply sampling an entire answer.

So to get around this, I think you either need a better approach to imitation learning (e.g. here is a proposal) or you need to add in RL.

I think the only reason we'd want to avoid imitation+RL is because informed oversight might be challenging, and that might make it too hard to construct an adequate reward function. You could hope to avoid that with a careful imitation learning objective (e.g. by replacing the GAN in the "mimicry and meeting halfway" post with an appropriately constructed bidirectional GAN).

I haven't been thinking about non-RL approaches because it seems like we need to solve informed oversight anyway, as an input into any of these approaches to avoiding malign failure. So I don't really see any upside from avoiding imitation+RL at the moment.

[-]Paul Christiano7y20

Given resource constraints, Alice estimating the proper level of confidence (under logical omniscience) is bound to wrong pretty often. If she's honest and well calibrated, she'll be overconfident and underconfident equally often, but this leaves Bob to be able to cherry pick which point to attack: he can select half of the points, spend twice as much time thinking about each one, and then attack the one where Alice turns out to have been most overconfident. So this style of debate seems to disadvantage answers that need to be supported by multiple weak lines of argument.

I'm relatively optimistic that some mechanism (perhaps along these lines) could fix the cherry-picked criticism problem. Though I normally think about that as a way to get debates to simulate amplification, which wouldn't be necessary in the ML case---we could just explicitly train predictors for the outcome of a debate on a particular subquestions, as in amplification. In practice I suspect that's a good idea.

Would this type of debate be useful amongst humans? For example could a company ask two potential suppliers to use this type of debate against each other to decide which one to buy from? If yes, why hasn't anyone done it before? If no, what's the relevant difference between AIs and humans that makes the idea not applicable to human debaters?

I think people do a lot of stuff-that-looks like debate already. (E.g. in the suppliers case, both suppliers may make a pitch and be free to criticize the others' pitch.)

My own view is that it is likely possible to run much better debates amongst humans, but that it's a hard project and the gains are not huge / institutions have issues and are hypocritical / tech progress is a public goods problem / etc.

Some differences in the ML case:

You might be able to do some kind of activation-sharing between the debaters (the kind of thing I discussed here) that allows one debater to know about any subtle flaws injected by the other debater. Obviously that's not possible with humans.
The same model is playing both sides, hopefully equalizing abilities. You could imagine that amongst humans, other differences in ability dominate the advantage from honesty (e.g. if one player is smarter or a smoother talker). In practice the ML model will have asymmetric abilities for the different sides of an argument, so I think this isn't a large advantage.

This assumes that Bob can reliably detect the same vulnerability, recognize Alice's attempt to exploit it, and figure out a way to convince the judge before the vulnerability is triggered. How can we know that humans only have vulnerabilities for which this is true?

As you probably guessed, I'm not sold on debate as a defense against this kind of attack.

[-]Wei Dai7y20

I think people do a lot of stuff-that-looks like debate already. (E.g. in the suppliers case, both suppliers may make a pitch and be free to criticize the others' pitch.)

My understanding is that suppliers usually don't see other suppliers' pitches so they can't criticize them. And when humans do debate, they don't focus on just one line of argument. (I guess I can answer that myself: in the AI training situation, the cost of the judge's time is extra high relative to the debaters' time.) EDIT: In the case of humans, it's hard to make the game fully zero-sum. Allowing suppliers to look at each others' pitches may make it easier for them to collude and raise prices, for example.

As you probably guessed, I'm not sold on debate as a defense against this kind of attack.

Does that mean the language in that section is more optimistic than your personal position? If so, does that language reflect the lead author's position, or a compromise between the three of you? (I find myself ignorant of academic co-authorship conventions.)

[-]Paul Christiano7y20

I think the language is a compromise, it's not far from my view though. In particular I endorse:

it is unlikely that a single short sentence is sufficient for this sort of mind hack

and

Successful hacks may be safely detectable at first, [...], although this does not cover treacherous turns where the first successful hack frees a misaligned agent

and I do think that restricting debaters to short sentences helps, I just don't think it fixes the problem.

[-]Wei Dai6y40

Geoffrey Irving has done an interview with the AI Alignment Podcast, where he talked about a bunch of things related to DEBATE including some thoughts that are not mentioned in either the blog post or the paper.

[-]Fabien Roger1y10

The link is dead. New link: https://openai.com/research/debate

Moderation Log