It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?
Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same as understanding deceptive alignment in a nearly sufficient capacity to actually execute it. It is possible deceptive alignment descriptions did not occur in the training data of GPT-3/GPT-3.5.
It is possible that being able to describe a capability is sufficient to execute it given a certain level of planning capabilities, situational awareness etc. but that does not seem obvious. I guess this is acknowledged implicitly to some extent later in the post.