well, the fact that I don't have an answer ready is itself a significant component of an answer to my question, isn't it?
A friend on an alignment chat said something to the effect of:
i think they are just sorely underestimating again and again the difference between a cute gang of sincere EA red teamers and the internet. the internet is where [...] lives for gods sake.
And so I figured I'd come here and ask about it. This eval seems super shallow, only checking if the model is, on its own, trying to destroy the world. Seems rather shallow and uncreative - it barely touched on any of the jailbreaks or ways to pressure or trick the model into misbehaving.
my impression is that by simulator and simulacra this post is not intending to claim that the thing it is simulating is realphysics but rather that it learns a general "textphysics engine", the model, which runs textphysics environments. it's essentially just a reframing of the prediction objective to describe deployment time - not a claim that the model actually learns a strong causal simplification of the full variety of real physics.
I think there may have been a communication error. It sounded to me like you were making the point that the policy does not have to internalize the reward function, but he was making the point that the training setup does attempt to find a policy that maximizes-as-far-as-it-can-tell the reward function. in other words, he was saying that reward is the optimization target of RL training, you were saying reward is not the optimization target of policy inference. Maybe.