I like the basic idea, but I don't understand the details, so by default won't include it in the newsletter. Some confusions:
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from ” I mean that in the zero-sum debate game sense. So you're still using self-play to converge on the Nash in the situation where you anneal towards debate, and otherwise you're using that self-play RL reward as part of the loss and the supervised amplification loss as the other part.
Are the arguments the same thing as answers?
The arguments should include what each debater thinks the answer to the question should be.
I think yours is aiming at the second and not the first?
Yep.
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.
But the answers are generated from pieces that involve humans, and those humans don't behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward... but then the equilibrium behavior could very well be different. For example, if you're advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I'm not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any who we trust enough to do amplification / debate in the first place, this isn't a problem, since is supposed to be more capable than . Alternatively you could say that at the very least is such that gives true and useful arguments, though that might conflict with training to imitate (as in the rock-paper-scissors example above).
Yep; that's basically how I'm thinking about this. Since I mostly want this process to limit to amplification rather than debate, I'm not that worried about the debate equilibrium not being exactly the same, though in most cases I expect in the limit that such that you can in fact recover the debate equilibrium if you anneal towards debate.
I really love the level of detail in this sketch!
I'm mentally substituting for some question more like "should this debate continue?", because I think the setup you describe keeps going until is satisfied with an answer, which might be never for weak . It's also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there's a right answer that the judge might be convinced of, I think will be trained to give it no matter the step parity, because when that happens it gets rewarded.
Really, it feels like the state of the debate is more like the state of a RNN, and you're going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.
you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between
How necessary is annealing for this? Could you choose other optimisation procedures? Or do you refer to annealing in a more general sense?
“Annealing” here simply means decaying over time (as in learning rate annealing), in this case decaying the influence of one of the losses to zero.
Let H:Q→A be a human.
[...]
Let Amp(H,M)(Q)=H(“What answer would you give to Q given access to M?”).
Nitpick: is meant to be defined here as a human with access to ?
It shouldn't be since is just a function argument here—and I was imagining that including a variable in a question meant it was embedded such that the question-answerer has access to it, but perhaps I should have made that more clear.
I too wanted that to be written differently, like H(Q|M) or something, since the human can imagine having access to the model without having actual access to the model.
Background
One possible way to train an amplification model is to use an auxiliary reinforcement learning objective to help guide the training of the amplification model. This could be done either by training two separate models, an agent and a question-answerer, or a single model trained on a joint objective. For example, from a comment Paul left on “A dilemma for prosaic AI alignment:”
In my last post, I expressed skepticism of such non-imitative amplification approaches, though in this post I want to propose a possible way in which some of my concerns with this style of approach could addressed by integrating ideas from AI safety via debate. I'll start by describing the basic idea in broad terms, then give a more careful, technical description of the sort of training procedure I have in mind.
The proposal
The basic idea is as follows: debate naturally yields an RL objective, so if you want to add an auxiliary RL objective to amplification, why not use the RL objective from debate? Specifically, the idea is to conduct a debate not between copies of the model M, but between copies of the amplified model Amp(M) (where Amp(M) is a human with access to the model M). That gives you both an RL reward arising from the winner of the debate as well as a supervised loss from the difference between the model's output and the amplified model's output. That's the basic proposal—now I'll try to give a more careful, technical description of how I think it should go.
Definitions
We will start with some initial definitions.
Training procedure
Then, consider the following training procedure.
Ways of combining the RL and supervised losses here include:
Analysis
The first thing which I want to note about this procedure is that it gives a continuous relaxation between amplification and debate—in fact, there are multiple ways to tune the procedure to limit to either pure amplification or pure debate. First, you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between. Second, the fact that this procedure uses Amp(M) to determine whether or not to stop the debate after every argument gives another way in which this procedure gives you a relaxation between amplification and debate: if the debate is always stopped after the first argument, you just get pure imitative amplification, whereas as the debate goes on for longer, you get closer and closer to just pure debate. What's really nice about this is that it means that when pure imitative amplification is giving good results, you'll just get pure amplification—since then continue0 will tend to be false—but when amplification isn't working, you'll get debate—since then continue0 will tend to be true.
As a result, it should be quite possible to enforce that this procedure limit to HCH—either by annealing the losses or by forcing continue0 to tend towards false. Thus, I think this procedure has a good chance of being outer aligned at optimum—or at least, a similar chance at it compared to pure imitative amplification. Unlike pure imitative amplification, however, this procedure gets to make use of the capability benefits of having an auxiliary RL objective to help guide training. Furthermore, since the auxiliary RL objective that we're using comes from debate, we get a lot of the benefits of debate as well, including the ability to incentivize the debaters to produce arguments that we wouldn't have necessarily though of ourselves, as well as the ability to train our debaters to use transparency tools against each other to help catch deception or other catastrophic behavior. That being said, I do think that whether or not something like this is inner aligned is still quite questionable—and is likely to depend highly on the specific transparency tools you have access to—though I do like the approach described here in general and I think it's definitely worth looking into more.
As an example approach for implementing something like Samp, see “A concrete proposal for adversarial IDA.” ↩︎