It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

Why imperfect adversarial robustness doesn't doom AI control

Egg Syntax4mo*11

the untrusted model is stateless between queries and only sees the command history and system state.

What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.

If you're assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Egg Syntax8mo10

Minor suggestion to link to more info on LLC (or at least write it out as local learning coefficient) the first time it's mentioned -- I see that you do that later down but seems confusing on first read.