All of Olli Järviniemi's Comments + Replies

Not sure I understand what you are saying.

If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.

If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the m... (read more)

4Buck Shlegeris
Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it's programming is an opportunity to attack.) Indeed; as you know, this is probably easier than driving the policy's P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably. What I was trying to say is that if you don't have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.

Glad to see public writing on what safety cases could look like!

I liked the section "Common precautions". Points like "wherever possible, run supervised fine-tuning on the eval", "run the eval on multiple model sizes", "ensure that we don’t train against measurements" and "all generated outputs are monitored" are commonly acknowledged as useful/important (even if obvious), so it's nice they were explicitly mentioned here.

In the AI control sketch, there's a part that I find unsatisfactory:

The situations for the game were generated using Simulation Proc

... (read more)
3Buck Shlegeris
Note that you can make untrusted monitoring work without the synthetic inputs too, you just have to do the honeypots during deployment instead of before deployment.

I don't think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.

Elaborating, I didn't notice signs of the models thinking that. I don't recall seeing outputs which I'd assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn't take the prompt seriously, and I didn't notice anything like that here.

Of course, such thoughts need not show in the model's completions. I'm unsure how conclusive the absenc... (read more)

I wouldn't read that much to the exact phrasing the model uses. Such confusions are not generally present in the model's completions.

E.g. see the second completion here. (There's also 200 completions more here.)

While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.  

In this case the clearest natural language interpretation of the model's outputs is about not revealing... (read more)

I view this post as providing value in three (related) ways:

  1. Making a pedagogical advancement regarding the so-called inner alignment problem
  2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong
  3. Pushing for thinking mechanistically about cognition-updates

 

Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.

Some mon... (read more)

[This comment is no longer endorsed by its author]Reply22
3Alex Turner
Just now saw this very thoughtful review. I share a lot of your perspective, especially: and

A local comment to your second point (i.e. irrespective of anything else you have said).

Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said "This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren't able to uproot it. Alignment is extremely stable once achieved" 

As I understand it, the point here is that your ... (read more)

3Alex Turner
No, this doesn't seem very symmetry breaking and it doesn't invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn't shown to be total in scope, as you say). Similarly, this result is not being celebrated as "total evidence." It's evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along. In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.  Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we're talking past each other.)