This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!
Here are some thoughts about the details:
Given that the mechinterp safety case involves running a kind of control evaluation, other monitoring methods can largely be substituted into the eval. This could enable cheap testing of other monitors e.g. based on supervised probing or mechanistic anomaly detection.
I like the idea to check whether a feature is causally necessary for defection in your model organisms as this gives extra certainty about generalization recall.
"Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution." (As is probably mentioned by others) This isn’t necessary for control safety cases as long as you’ve done a sufficient job of eliciting performance from your model (e.g. i.i.d. elicitation with human demonstrations of attacks, which as is mentioned might be difficult).
The third kind of safety case is pretty novel to me which makes it exciting but also I’m not convinced by it yet, for reasons that are mostly already outlined in the blogpost.
The evidence becomes very diluted by multiple hypothesis testing when trying to rule out any cluster of deceptive personas being reinforced. We probably wouldn’t be able to get convincing affirmative evidence with this approach without a meta-level eval.
I think exploration hacking shouldn’t be a concern here if you can get the 10-90% with SFT. (EDIT: However you might be concerned that the 10-90% does not contain the particular deceptive personas that get reinforced.)
This is really exciting to see not only as a way to understand better what people at Anthropic are thinking about how to make ASL-4 safety cases but also as a survey of how various safety agendas might play a role in practice!
Here are some thoughts about the details: