One potential critique that I have to this (recognizing that I'm not nearly an expert on the subject, and that this may be a very stupid critique), on the other hand, is that being able to stimulate the environment as if it is the real one to test if they're faking alignment seems like a pretty robust way to see if some model is actually aligned or not. Alignment faking seems much bad of an issue if we can see how much it's faking by (I think).
There is what to be said here about how that may be quite bad when the model is "loose," though.
As in, maybe we can just fake the deployment environment and test the AIs behavior in this case?
Quick dialogue:
Bob: Can't we just simulate what the deployment environment looks like upfront and see if the AI does anything bad?
Alice: Well, for one, that might be quite hard to do so realistically if your AI is quite smart and is trying to figure out if it is being tested. You can of course, ongoingly review some of the AI's actual behavior in deployment. (Putting privacy and logistical concerns aside.)
Bob: Yeah, why doesn't that solve our problem.
Alice: Well, the AI can also realize you might do this. So, if it can try to pick a strategy which has the greatest chance of achieving it's misaligned aims. In this case, it could either suddenly or rarely take misaligned actions such that you don't spot it before it is too late or it could take subtly bad actions often that help it achieve it's aims.
Bob: Ok, but probably we can make this pretty hard for the AI to pull off right? It sounds kinda hard for the AI to get away with something.
Alice: Maybe, that is the hope of the AI control agenda. We can generalize these concerns to be "suppose the AI was an adversary trying to subvert our safety measures (i.e. an insider threat), would that cause problems?". I think it's pretty unclear how easy it would be to ensure control, and I currently expect at least some aspects to be reasonably hard. In particular, I expect at the very least it require substantial effort to evaluate that AIs are controlled for AIs capable of automating research. I also expect control to be basically impossible for very super human models[1].
----------------------------------------
1. That said, it does seem pretty plausible that if we did align smart models, those models could control somewhat smarter models. ↩︎
This is very interesting research!
One potential critique that I have to this (recognizing that I'm not nearly an expert on the subject, and that this may be a very stupid critique), on the other hand, is that being able to stimulate the environment as if it is the real one to test if they're faking alignment seems like a pretty robust way to see if some model is actually aligned or not. Alignment faking seems much bad of an issue if we can see how much it's faking by (I think).
There is what to be said here about how that may be quite bad when the model is "loose," though.