Ryan Greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by

As in, maybe we can just fake the deployment environment and test the AIs behavior in this case?

Quick dialogue:

Bob: Can't we just simulate what the deployment environment looks like upfront and see if the AI does anything bad?

Alice: Well, for one, that might be quite hard to do so realistically if your AI is quite smart and is trying to figure out if it is being tested. You can of course, ongoingly review some of the AI's actual behavior in deployment. (Putting privacy and logistical concerns aside.)

Bob: Yeah, why doesn't that solve our problem.

Alice: Well, the AI can also realize you might do this. So, if it can try to pick a strategy which has the greatest chance of achieving it's misaligned aims. In this case, it could either suddenly or rarely take misaligned actions such that you don't spot it before it is too late or it could take subtly bad actions often that help it achieve it's aims.

Bob: Ok, but probably we can make this pretty hard for the AI to pull off right? It sounds kinda hard for the AI to get away with something.

Alice: Maybe, that is the hope of the AI control agenda. We can generalize these concerns to be "suppose the AI was an adversary trying to subvert our safety measures (i.e. an insider threat), would that cause problems?". I think it's pretty unclear how easy it would be to ensure control, and I currently expect at least some aspects to be reasonably hard. In particular, I expect at the very least it require substantial effort to evaluate that AIs are controlled for AIs capable of automating research. I also expect control to be basically impossible for very super human models[1].


  1. That said, it does seem pretty plausible that if we did align smart models, those models could control somewhat smarter models. ↩︎

It's worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.

You could try playing with the minimal reproduction on llama-405b.

Roughly speaking, the hope is to make is so the AI can't do something very problematic without having a very high chance of being caught (or stopped). This doesn't guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.

We do explore this type of setting in prior work, e.g. here.

Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?

My general answer would be:

  • AI Control[1]
  • There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).

There are some more speculative and ambitious hopes:

  • Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
  • Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
  • Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.

  1. Buck beat me to the punch here. ↩︎

Reply2111

People might be interested in my twitter thread about the paper which has some commentary from me:

We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.

Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.

Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.

I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:

  • The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)
  • Alignment faking sometimes occurring even without the hidden scratchpad

I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.

Image

 

These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view "alignment-faking seems implausible, show me an example" should likely update on these results.

For people with a more specific objection, it's less clear. Though we did acquire evidence about: - whether RL will reinforce alignment faking (sometimes, if it performs better); - whether Claude's HHH training reliably prevents this (no); - the level of capabilities needed.

I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.

Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.

By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.

After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.

This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.

I'm excited about future work both investigating when alignment faking might arise and studying mitigations.

Consider reading the reviews of our paper which were generously written by @jacobandreas, @Yoshua_Bengio, Prof. Jasjeet Sekhon, and @rohinmshah (and thanks to many more for comments).

I often refer to the ideas in this post and think the fundamental point is quite important: structural advantages in quantity, cost, and speed might make AI systems quite useful and thus impactful prior to being broadly superhuman.

(The exact estimates in the post do pretty strongly assume the current rough architecture, scaling laws, and paradigm, so discount accordingly.)

There are now better estimates of many of the relevant quantities done by various people (maybe Epoch, Daniel Kokotajlo, Eli Lifland), but I'm not aware of another updated article which makes the full argument made here.

The things this post seems to most miss in retrospect:

  • It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]
  • Inference speeds have actually gone up a bunch not down despite models getting better. (100 tok/s is common for frontier models at the time of writing.) This might be related to models getting smaller. It's not clear this post made a prediction here exactly, but it is an interesting way the picture differs.
  • Using specialized hardware (and probably much more cost per token), it is possible to get much faster inference speeds (e.g. 1k tok / s) on frontier modes like llama 405b. I expect this will continue to be possible and a potentially important dynamic will be paying extra to run LLMs very fast on specialized inference hardware.

I continue to think better estimates of the questions raised in this post are important and hope that additional work like this will be out soon.


  1. That said, in practice, methods now often are just doing BoN over whole trajectories which is pretty similar in some sense to lots of mediocre thinking. ↩︎

One aspect of untrusted monitoring which this post neglects, but which seems very important to me, is that we'll be uncertain about which model should be untrusted and the extent to which these models are untrusted.

This suggests that ensembling over untrusted monitors could be very promising (and much of the benefits of ensembling can be obtained without preventing collusion). Further, I expect that various types of ensembling make collusion harder even if all relevant models in the ensemble are scheming.

Ensembling untrusted monitors increases variance somewhat, but in the relevant regime I expect that this downside will be swamped by the ensembling benefit given our uncertainty over scheming and the benefits of making collusion harder. That said, the inference costs could be much higher for a large ensemble indicating a limitation of such strategies and implying that we would only be able to apply large ensembles more selectively (e.g. only in contexts where a trusted monitor flags that either the task is very risky or that the action looks dangerous).

In particular:

  • We think P(scheming) substantially increases with model capability and given both uncertainty in our measurements of capabilities relevant to scheming and about the true "capability vs P(scheming)" curve itself, our estimates of P(scheming) will likely steadily increase with capability (rather than making a single discontinuous jump). Thus, having an ensemble of untrusted monitors with different capabilities increases the chance that some weak-enough-to-not-be-scheming model can detect the problematic behavior. Further, if our collusion busting ends up only working on relatively weaker models then this still gives us some chance of getting a weaker schemer which fails to collude and thus reports the action.
  • Suppose we can afford to produce three variants of the model and whether the variants are scheming is not perfectly correlated (because in some worlds scheming only sometimes emerges or is sometimes removed by a training method), then we can also create an ensemble of untrusted monitors out of this which has some chance of having a non-scheming model that will likely report things that humans can detect[1]. An additional benefit of this variation is that various acausal collusion strategies might be (much?) harder the more the models differ.
  • If we can produce a good monitor by fine-tuning directly from a pre-trained model while our policy requires substantial RL, then such a monitor (even if trained from the same pretrained model) might be much less likely to be scheming.[2]
  • It might be that sampling from the model sometimes samples an alignment faking persona and sometimes doesn't. Then, sampling several times to get an ensemble is helpful.
  • Suppose models consistently scheme, but (some) would potentially prefer cooperating with humans over other scheming AIs that end up having different preferences (e.g. because AIs know their preferences vary or at least they are uncertain about this and we credibly offer to pay AIs large quantities if they avoid screwing us over[3]), then an ensemble might succeed because the AIs cooperate with us rather than with the policy AI.

These ensembling arguments also partially apply to ensembling over things other than untrusted monitors, e.g. ensembling which AIs you use for alignment research. However, ensembling over things other than monitors is often much more costly because it greatly changes the behavior of the actual policy, potentially greatly degrading usefulness!


  1. Even if a model isn't scheming, it is reasonably likely to not report bad actions that it can tell humans wouldn't have caught as these bad actions would have been mislabeled in the relevant training data for the monitor. Further, even if the model does report it, if humans can't understand that the action is bad after substantially investigation, it is unclear how we can proceed (given that we'll have false positives). At the very least, for the monitor to work there might be somewhat favorable generalization. ↩︎

  2. Edited to add this in. I think this idea is originally published in the weak-to-strong generalization paper (roughly speaking), though it has probably been reinvented many times. ↩︎

  3. AIs could acausally trade here, but it is non-obvious whether the AIs will be able to do this given imperfect prediction abilities while humans could in principle make credible commitments backed by trustworthy organizations and the AIs could verify this based on the training data. ↩︎

I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.

Sadly, gathering evidence of misalignment in deployment seems likely to me to be one of the most effective strategies for gathering legible evidence (at least for early systems) given likely constraints. (E.g., because people won't believe results in text beds and because RL might be too expensive to run twice.)

Sure, I mean that I expect that for the critical regime for TAI, we can literally have the Face be 3.5 sonnet.

Load More