Thoughts on "AI safety via debate"

Gordon Seidoh Worley

Geoffrey Irving, Paul Christiano, and Dario Amodei of OpenAI have recently published "AI safety via debate" (blog post, paper). As I read the paper I found myself wanting to give commentary on it, and LW seems like as good a place as any to do that. What follows are my thoughts taken section-by-section.

1 Introduction

This seems like a good time to confess that I'm interested in safety via debate because I thought about it prior to the publication of this paper and don't think it will work. I like the general direction and think it is of the kind of thing that is likely to work because it puts pressure on AI decision processes to be both value aligned and believably aligned, but I think debate as a form has limitations that inherently make it unlikely to produce aligned AI. I instead prefer the idea of safety via dialectic and have been working on a not-yet-ready-for-publication AI alignment proposal I call "dialectical alignment".

I point this out to give some context to my thoughts and lay my biases on the table: I'm already thinking about something pretty similar to what we might call "safety via debate" but am skeptical of debate itself.

2 The debate game

Debate is here described as a specific game with mathematical formalism rather than the fuzzy human process we often call debate. I appreciate this because it lets us compare the game precisely to the dialectic process I prefer.

In the debate game we have two agents. Each is shown the question and asked for an answer independently and with no knowledge of the other agent's answer. Unlike in most human debate the two agents are allowed to give the same answer if they like. The two agents then go back and forth making statements in support of their answers and each agent has knowledge at this point of the other agent's answer and the previous statements made. The game is decided by a (human) judge who knows the question, the answers, and the statements. Critically the debate game is zero-sum and hinges on the claim that it is harder to lie than to refute a lie within the context of the game.

In contrast I think of the dialectic process as one with a single agent who starts with a question and proposes a thesis. The agent presents statements supporting the thesis up to some bound (e.g. number of statements, time spent producing support statements, etc.), then puts forward the antithesis by negating the thesis and repeats the process of presenting supporting statements for the antithesis. This results in evidence both for and against the thesis and antithesis, and the agent then proposes a synthesis of thesis and antithesis that is better supported (viz. more likely to be true given the evidence of the supporting statements) by all the supporting statements of the thesis and antithesis than either the thesis or antithesis is supported by all the supporting statements. The process is then repeated with the synthesis as the new thesis up to some bound on how long we are willing to search for an answer. Although the agent's statements and how much they support the thesis, antithesis, and synthesis may initially be assessed by an outside judge (probably human) during training, the intention is that the agent will eventually be able to make its own judgements.

What I find lacking in the debate game is that it requires the production of answers prior to knowledge about arguments for and, critically, against those answers, and lacks a way to update answers based on information learned within the round. For example, within the example debate given between Alice and Bob proposing Alaska or Bali, respectively, as vacation destinations, if Alice realizes Bob's proposal of Bali makes her prefer to have answered Hawaii instead of Alaska this counts as a win for Bob and Bali, but in an important sense it is a loss for both: there was a better answer and one of them figured it out but the debate game's structure prohibits switching to it. The authors will hopefully address this point later in the paper, but for the time being I see this as a critical deficiency because it only gives us a way to choose winners and disincentivize lying and importantly lacks a way to move towards better answers (this is due to the zero-sum aspect of the game).

Thus if we want to use debate to get aligned AI it's going to have to be used in conjunction with a larger process that allows agents to change or be replaced, such as using the debate game as part of a fitness function to evolve a believable AI or as part of an iterative amplification-distillation process.

3 Experiments: proposed and preliminary

I skipped this section since it seems mostly irrelevant to my interests.

4 Reasons for optimism

Section 4.1 does a great job of addressing the computational and game-theoretic issues around ignorance and uncertainty in a way that even matches how those issues are handled in human debates.

Section 4.2 fails to convince me that we have much reason to believe agents playing the debate game would not be incentivized to lie the way humans are often incentivized to lie in human debate. The hope (the authors also use "hope" to express their thoughts on this) is that the judge can disincentivize lying through punishing it when it pops up, but I don't see a compelling argument to believe it can reliably do this.

Section 4.3 suggests debate will be a powerful optimization process the way other adversarial training processes have recently proven successful. On this point I agree.

5 Reasons to worry

Overall I'm impressed at the consideration of the problems with debate, many of them not ones I initially considered. I agree with the authors that debate may fail if humans are insufficiently capable judges of debate, honesty may not be optimal, honestly may especially not be optimal in computable agents, and there may be dangers associated with training AI to be good at debate if honesty is not sufficiently guaranteed. As it seems we'll look at later, debate is not likely to be sufficient and is only one tool that might be useful when combined with other techniques so long as it does not make AI developed through debate lose performance.

Sections 5.6 and 5.7 are of particular interest to me because they address worries that also exist for dialectical alignment. Specifically, both debate and dialectic may fail to converge with debate failing to converge by new statements causing the judge to continually flip answer choice and dialectic by failing to converge on a synthesis that gets more likely as it incorporates more evidence. Alas, much as I don't have a general solution for the convergence problem in dialectic, neither do the authors offer one for debate.

I come away from sections 4 and 5 even less certain that debate is likely to work.

6 Refinements and variations on debate

I'm excited by the proposals in this section, especially 6.2 since it allows the kind of information sharing I'm hoping AI can take advantage of via dialectic and 6.4 since it reduces some of the impact from debate incentives to lie. My suspicion is that there is a sense in which I can build on the idea of debate as presented to better describe my own ideas about dialectic alignment.

7 Approximate equivalence with amplification

Not much to say here: debate and amplification are similar but with important implementation differences, yet nevertheless operate on many of the same principles. Dialectic alignment would be similar too but removes the adversarial component from debate and replaces amplification/distillation with the thesis-antithesis-synthesis cycle.

8 Conclusions and future work

The authors encourage readers to search for alignment proposals similar to amplification and debate; I happen to think dialectic fits this bill and offers benefits but I'll have to make that case more fully elsewhere.

Having read the whole paper now I remain concerned that debate is not likely to be useful for alignment. Aside from the adversarial training issues that to me seem likely to produce agents optimized for things other than human values in the service of winning debate even if it is constrained by being judged by humans, debate also lacks in itself a way to encourage agents to update on information and prefer developing arguments that allow it to win. To be fair the authors seem aware of this and acknowledge that debate would need to be combined with other methods in order to provide a complete alignment solution, and in this light it does seem perhaps reasonable that if we engineer our way to alignment rather than prove our way to it debate may help address some subproblems in alignment that are not as well addressed by other methods.

Strangely I find reading about debate makes me feel a bit more confident that amplification and Paul et al.'s approach to alignment at OpenAI is likely to work, keeping in mind I've recently been flip-flopping a bit on my assessment of it (cf. my recent assessment of existing alignment programs and my comments on Stuart's thoughts on amplification). I'm not sure if this says more about my uncertainty around what Paul et al. are attempting or reflects high-variance updates based on developments in their program, but it is worth noting if you're looking to my assessments as evidence about their program. Hopefully I will have more to say soon about dialectical AI alignment so it can be more fully evaluated in comparison to ideas like debate and amplification.