tjbai

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

An issue with training schemers with supervised fine-tuning

It's not clear to me that you do get stronger guarantees because the setting and method is so similar to that of classical imitation learning. In both cases, we seek to learn a policy that is aligned with the expert (human). Supervised fine-tuning (behavioral cloning) is problematic because of distribution shift, i.e. the learned policy accumulates error (at a quadratic rate!) and visits states it did not see in training.

You say this failure mode is dangerous because of scheming AI and I say it's dangerous because the policy is OOD, but it appears you agree that the AI only "recognizes" it's not in training because of distribution shift—"Halfway through the generation, the AI could detect those imitation mistakes..." To me, it appears the differing justifications for why the AI performs poorly/dangerously is a matter of interpretation, not a fundamental difference.

I also don't think it's fair to describe DAgger as just "correcting errors that humans can recognize" because it actually provides formal bounds on error accumulation, which would appear to limit the failure mode you describe here. Admittedly, I'm very new to safety research as a whole, but this feels a bit like a reframing of an old problem.

Reply

An issue with training schemers with supervised fine-tuning

tjbai9mo152

How does this differ from DAgger (https://arxiv.org/abs/1011.0686)?

Reply