User Comment Replies — AI Alignment Forum

An issue with training schemers with supervised fine-tuning

It's not clear to me that you do get stronger guarantees because the setting and method is so similar to that of classical imitation learning. In both cases, we seek to learn a policy that is aligned with the expert (human). Supervised fine-tuning (behavioral cloning) is problematic because of distribution shift, i.e. the learned policy accumulates error (at a quadratic rate!) and visits states it did not see in training.

You say this failure mode is dangerous because of scheming AI and I say it's dangerous because the policy is OOD, but it appears you agre... (read more)

6Ryan Greenblatt9mo

(Also, to be clear, thanks for the comment. I strong upvoted it.)

2Ryan Greenblatt9mo

I would say that it is dangerous in the case where is is both OOD enough that the AI can discriminate and the AI is scheming. Neither alone would present a serious (i.e. catastrophic) risk in the imitation context we discuss.

5Ryan Greenblatt9mo

Hmm, I think I was wrong about DAgger and confused it with a somewhat different approach in my head. I agree that it provides bounds. (Under various assumptions about the learning algorithm that we can't prove for NNs but seem reasonable to assume in practice.) I now agree that the proposed method is basically just a slight tweak of DAgger to make it more sample/cost efficient in the case where our issue is discrimination by the policy. I agree this is a special case of well known issues with behavioral cloning - we probably should have made this more clear in the post.

An issue with training schemers with supervised fine-tuning

tjbai9mo152

How does this differ from DAgger (https://arxiv.org/abs/1011.0686)?

5Fabien Roger9mo

I have edited the post to add the relevant disclaimers and links to the papers that describe very similar techniques. Thank you very much for bringing these to my attention!

5Ryan Greenblatt9mo

Edit: I now think this is false for how DAgger is presented in the paper, see discussion below. The method and the motivation is similar, though note that DAgger is effectively an RL scheme trying to maximize performan while we're trying to avoid a particular failure mode due to misalignment. From my understading DAgger just involves correcting errors that humans can recognize while we're trying to get stronger guarantees.

AI ALIGNMENT FORUM
AF

All of tjbai's Comments + Replies