It's not clear to me that you do get stronger guarantees because the setting and method is so similar to that of classical imitation learning. In both cases, we seek to learn a policy that is aligned with the expert (human). Supervised fine-tuning (behavioral cloning) is problematic because of distribution shift, i.e. the learned policy accumulates error (at a quadratic rate!) and visits states it did not see in training.
You say this failure mode is dangerous because of scheming AI and I say it's dangerous because the policy is OOD, but it appears you agree that the AI only "recognizes" it's not in training because of distribution shift—"Halfway through the generation, the AI could detect those imitation mistakes..." To me, it appears the differing justifications for why the AI performs poorly/dangerously is a matter of interpretation, not a fundamental difference.
I also don't think it's fair to describe DAgger as just "correcting errors that humans can recognize" because it actually provides formal bounds on error accumulation, which would appear to limit the failure mode you describe here. Admittedly, I'm very new to safety research as a whole, but this feels a bit like a reframing of an old problem.
It's not clear to me that you do get stronger guarantees because the setting and method is so similar to that of classical imitation learning. In both cases, we seek to learn a policy that is aligned with the expert (human). Supervised fine-tuning (behavioral cloning) is problematic because of distribution shift, i.e. the learned policy accumulates error (at a quadratic rate!) and visits states it did not see in training.
You say this failure mode is dangerous because of scheming AI and I say it's dangerous because the policy is OOD, but it appears you agree that the AI only "recognizes" it's not in training because of distribution shift—"Halfway through the generation, the AI could detect those imitation mistakes..." To me, it appears the differing justifications for why the AI performs poorly/dangerously is a matter of interpretation, not a fundamental difference.
I also don't think it's fair to describe DAgger as just "correcting errors that humans can recognize" because it actually provides formal bounds on error accumulation, which would appear to limit the failure mode you describe here. Admittedly, I'm very new to safety research as a whole, but this feels a bit like a reframing of an old problem.