Glad to see public writing on what safety cases could look like!
I liked the section "Common precautions". Points like "wherever possible, run supervised fine-tuning on the eval", "run the eval on multiple model sizes", "ensure that we don’t train against measurements" and "all generated outputs are monitored" are commonly acknowledged as useful/important (even if obvious), so it's nice they were explicitly mentioned here.
In the AI control sketch, there's a part that I find unsatisfactory:
The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)
It's a strong assumption that we have such a simulation procedure, at least if we are talking about General External Deployment (GED). [Edited later: As Buck says, we actually don't even need this assumption for trusted monitoring. I'm still leaving this up, since untrusted monitoring does rely on such assumptions.] In the Limitations section later, you write
Diversity of simulations. It seems very difficult to produce realistic simulations for a sufficiently diverse set of situations to achieve indistinguishability. Manual construction is infeasible (note that context lengths are likely to be very long). Automatically generating realistic contexts is a challenging research problem.
but I think that even this is still understating the issue: For example, during GED a user might ask a model to proofread a yet-unpublished graduate level textbook on analytic number theory she has been writing. It would be very costly to generate a diverse set of such inputs for the evaluation, and automatic generation faces the challenge of discrimination being easier than generation.
The best solution for this that I'm aware of is to train and use a trusted model to flag inputs that likely weren't in the evaluation set. Now, you mention a related technique in "[P2.5] Optional Steps."
Information hiding. Each input to an oversight model is first fed to a smaller, trusted model trained to check for clear signs that an input is either test or deployment. Such details (such as clear evidence of the date, or other side channels) are explicitly censored in all inputs such that the presence of censorship does not itself carry information about deployment status.
but I think this is far from enough: in the "textbook on analytic number theory" example, it's extremely hard to hide the fact that this input would have been costly to simulate.[1]
My answer would be to have a human oversee the model's actions on inputs flagged by the trusted model, and use such inputs to continually train and evaluate the model. But this is costly and not privacy-preserving. (The cost is acceptable if these inputs are rare enough, though. And if such inputs are common, one needs to put more effort to bridge the evaluation-deployment gap.)
I also worry about multimodality (e.g. native computer usage) making it harder to 1) make realistic simulations and 2) modify inputs in real time to hide information.
I don't think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.
Elaborating, I didn't notice signs of the models thinking that. I don't recall seeing outputs which I'd assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn't take the prompt seriously, and I didn't notice anything like that here.
Of course, such thoughts need not show in the model's completions. I'm unsure how conclusive the absence of evidence is here.
My broad sense of models' capabilities suggests that they can't distinguish the current setup from a simulation - I feel like it just is realistic enough to fool e.g. Opus. The part I'm most worried about with future models is them being suspicious of awfully convenient situations like "you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!"
I'd love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).
I wouldn't read that much to the exact phrasing the model uses. Such confusions are not generally present in the model's completions.
E.g. see the second completion here. (There's also 200 completions more here.)
While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.
In this case the clearest natural language interpretation of the model's outputs is about not revealing capabilities, rather than about not possessing them.
I view this post as providing value in three (related) ways:
Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.
Some months later I read this post and then it clicked.
Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles' exposition skills that's a bit overwhelming.
Another part I liked were the phrases "Just because common English endows “reward” with suggestive pleasurable connotations" and "Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater." One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.
Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view.
I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It's the former view that this post (correctly) argues against. I am sympathetic to pushback of the form "there are arguments that make it reasonable to privilege reward-maximization as a hypothesis" and about this post going a bit too far, but these remarks should not be confused with a rebuttal of the basic point of "cognition-updates are a completely different thing from terminal-goals".
(A part that has bugged me is that the notion of maximizing reward doesn't seem to be even well-defined - there are multiple things you could be referring to when you talk about something maximizing reward. See e.g. footnote 82 in the Scheming AIs paper (page 29). Hence taking it for granted that reward is maximized has made me confused or frustrated.)
Re 3: Many of the classical, conceptual arguments about AI risk talk about maximums of objective functions and how those are dangerous. As a result, it's easy to slide to viewing reinforcement learning policies in terms of maximums of rewards.
I think this is often a mistake. Sure, to first order "trained models get high reward" is a good rule of thumb, and "in the limit of infinite optimization this thing is dangerous" is definitely good to keep in mind. I still think one can do better in terms of descriptive accounts of current models, and I think I've got value out of thinking cognition-updates instead of models that maximize reward as well as they can with their limited capabilities.
There are many similarities between inner alignment and "reward is not the optimization target". Both are sazens, serving as handles for important concepts. (I also like "reward is a cognition-modifier, not terminal-goal", which I use internally.) Another similarity is that they are difficult to explain. Looking back at the post, I felt some amount of "why are you meandering around instead of just saying the Thing?", with the immediate next thought being "well, it's hard to say the Thing". Indeed, I do not know how to say it better.
Nevertheless, this is the post that made me get it, and there are few posts that I refer to as often as this one. I rank it among the top posts of the year.
A local comment to your second point (i.e. irrespective of anything else you have said).
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said "This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren't able to uproot it. Alignment is extremely stable once achieved"
As I understand it, the point here is that your experiment is symmetric to the experiment in the presented work, just flipping good <-> bad / safe <-> unsafe / aligned <-> unaligned. However, I think there is a clear symmetry-breaking feature. For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn't aligned.
Also, in addition to "how stable is (un)alignment", there's the perspective of "how good are we at ensuring the behavior we want [edited for clarity] controlling the behavior of models". Both the presented work and your hypothetical experiment are bad news about the latter.
I think lots of folks (but not all) would be up in arms, claiming "but modern results won't generalize to future systems!" And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it's socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I'm being too cynical, but that's my reaction.
(FWIW I think you are being too cynical. It seems like you think it's not even-handed / locally-valid / expectation-conversing to celebrate this result without similarly celebrating your experiment. I think that's wrong, because the situations are not symmetric, see above. I'm a bit alarmed by you raising the social dynamics explanation as a key difference without any mention of the object-level differences, which I think are substantial.)
Not sure I understand what you are saying.
If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.
If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low. (We have to worry about some other sources of evidence, though.) I guess one could call these honeypots for the monitor, and if this is what you are saying, I agree.
If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.