Kudos for releasing a concept of a plan! Some thoughts:
Regarding the first safety case:
The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
I believe that getting good results from the following experiments might be particularly unrealistic:
"In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
I worry that "truth-related" features are very difficult both in principle and in practice to isolate, as the model's representation of what other people / entities believe will be hard to disentangle from what it itself believes, to the extent that those are even meaningfully separable concepts.
Even if you could identify a feature roughly corresponding to "I believe the current sentence is true", it is not immediately clear to me how this could be used to successfully convince the model of more complex situational information, e.g. it is not being monitored and is in a real deployment setting. For instance, I don't expect artificially activating that feature on the sentence "I am not being monitored and this is a real deployment setting" to just work.
"In all of these model organisms, deactivating the features of concern was sufficient to prevent the harmful behaviors"
I can't concisely refute this. It is just a "huge if true" kind of claim, and as far as I am aware very much not possible with current techniques. If we could isolate a set of linear directions that were robustly causally responsible for scheming behavior, that would have massive ramifications far beyond this setting.
I worry a little bit about "training on test" here – there may exist some overfit set of classifiers that can stop model organisms from misbehaving in a specific setting without meaningfully generalizing. This is also related to the stated "Realism of the model organisms" concerns.
I strongly approve of this element of the plan: "In order to preserve the integrity of the learned features, we ensured that the learned features were never used as part of the training procedure for the models."
It's kind of hard to do this in full generality, since you need to not only avoid training directly against the probes, but also avoid using the information from probes at all when selecting what to train on.
I'm more excited about the control safety case, but have a few nitpicks:
Developing a good elicitation strategy is indeed challenging.
I think it is pretty crucial to use real human red-teaming, especially if sandbagging (more specifically something like exploration hacking) is a concern.
It may be helpful to give the red team way more affordances than the untrusted model would actually have in deployment, and only remove affordances if defense becomes impossible. Some examples of extra affordances:
Letting the red team know the blue team's exact strategy in order to search for the best response
Giving an untrusted model more unmonitored chain-of-thought / tool use / human assistance than it would actually have in deployment
Allowing human red-teamers to repeatedly attempt to bypass a monitor when in reality a scheming model might have to one-shot it
Figuring out exactly what extra affordances suffice to break the control protocol may also provide useful signal about where the protocol could be strengthened, and what capabilities need to be tested for going forward.
Eliciting capabilities from substantially superhuman models is a much harder problem, and is one of the main reasons I don't expect control to suffice for a safety case well beyond human-level AI.
The space of strategies beyond trusted monitoring alone is pretty large. You mention untrusted monitoring, but there is also a broad space of strategies that include editing outputs with a trusted model, allocating more monitoring/auditing budget in certain situations, etc.
It is probably best to continually produce new honeypots throughout deployment.
For instance, if untrusted models are being used to do lots of automated software engineering and research, likely requiring huge amounts of context, it will become more obvious to the model over time that it is in deployment.
Making new honeypots out of actual deployment settings, for instance by adding exploitable security flaws, may generally make it much harder for the model to tell when it is being tested.
I haven't read the last safety case yet, but may have some thoughts on it later. I am most excited about control at the moment, in part due to concerns that interpretability won't advance far enough to suffice for a safety case by the time we develop transformative AI.
Kudos for releasing a concept of a plan! Some thoughts:
Regarding the first safety case:
I'm more excited about the control safety case, but have a few nitpicks:
I haven't read the last safety case yet, but may have some thoughts on it later. I am most excited about control at the moment, in part due to concerns that interpretability won't advance far enough to suffice for a safety case by the time we develop transformative AI.