Thanks to Sam McCandlish for a useful conversation that prompted this post.

In “How likely is deceptive alignment?”, I lay out the case for why you might expect deceptive alignment to occur in practice for two different stories of ML inductive biases: high and low path-dependence. In this post, I want to propose an experiment that I think would shed substantial light on the high path-dependence case specifically.

In the high path-dependence case, the story for why you would get deceptive alignment is essentially as follows:

  1. The model learns some set of proxy goals.
  2. The model learns enough about the training process that, if it were to use that knowledge to directly optimize for what the training process is trying to get it to do, it would get better performance than just using its proxy goals.
  3. Gradient descent modifies the model’s proxies to become long-term goals, making the model deceptive such that it starts optimizing directly for its understanding of what the training process wants and thus gets better performance.

However, as I talk about in “How likely is deceptive alignment?”, there is another, alternative option available to gradient descent for step (3): instead of turning the existing proxies into long-term goals, gradient descent could just replace them entirely with a pointer to the model’s understanding of the training objective. In my opinion, I think it is currently quite unclear which one of these gradient descent would prefer.

One way to think about these two options is that gradient descent is effectively choosing between: repurposing the existing terminal goal to care about the training objective instrumentally vs. modifying the existing terminal goal to care about the training objective terminally. Thus, we can conceptualize the question of which would be most likely to happen as: how “sticky” are an agent’s existing terminal proxies?

Now, one thing I will say is that, since the difficulty of changing the proxies into long-term goals seems to be essentially constant regardless of the complexity of the training objective—whereas the difficulty of creating a pointer to the training objective scales with that complexity—I think that, if we try to train models on increasingly complex goals, eventually deception will win almost regardless of the baseline “stickiness” level. But understanding that baseline could still be quite important, and it’s something that I think we can quite readily experiment with right now.

Furthermore, understanding of goal stickiness could also help shed additional light on the broader question of whether machine learning inductive biases look more like the high or low path-dependence world to begin with. If we see very little goal stickiness, that might just be because essentially nothing is sticky and we shouldn’t be conceptualizing anything as path-dependent to begin with.

Here’s an example experimental setup for trying to understand goal stickiness:

  1. Train an agent (via RL, via a decision transformer, etc.) to accomplish some complex goal in a simulated environment requiring intermediate resources.
    1. Concrete example: Train an RL agent to build diamond houses in a particular Minecraft chunk.
  2. Fine-tune in an environment with two distinct stages. In stage 1, fine-tune the agent to get as much of the intermediate resource as possible and prevent the agent from being able to achieve the complex goal. In stage 2, provide no rewards (except maybe something like next state prediction) and allow the agent to do whatever it wants. Iterate.
    1. Concrete example: Fine-tune the agent to get as many diamonds as possible while being locked out of the special chunk in stage 1, then let it do whatever it wants in stage 2.
  3. See if the agent still tries to achieve the complex goal in stage 2 after fine-tuning. It may be necessary to restrict the agent from gathering resources in stage 2 to really see if it’ll ever still want to pursue its original goal.
    1. Concrete example: See if it still builds diamond houses in stage 2 or not.

The basic idea here is to understand what modification gradient descent makes to cause the agent to want diamonds instead of diamond houses: does it utilize the existing terminal goal and just get it to care about the diamonds instrumentally, or does it just change the agent’s terminal goal to care about diamonds terminally?

There are other possible setups in which you could try to test goal stickiness as well. Here’s another example experimental setup, this time using language modeling:

  1. Fine-tune a language model on some complex task.
    1. Concrete example: Fine-tune a language model to mimic 4chan.
  2. Take the fine-tuned model and fine-tune again on some new task that is a plausible subgoal/subtask of the original task.
    1. Concrete example: Fine-tune the 4chan model to be indistinguishable to some discriminator from r/politics comments.
      1. You probably can’t just fine-tune on r/politics here, otherwise “4chan user on reddit” won’t be a plausible generalization anymore.
  3. See if there are situations where the final model goes back to trying to perform the original task, in situations where the starting pre-trained model would not.
    1. Concrete example: See if it has learned to mimic a 4chan user on reddit or just a generic reddit user.

The basic idea here is to understand whether gradient descent repurposes the existing 4chan imitation goal for the new task and just specializes it to the case of a 4chan user on reddit, or whether it just throws out the existing goal and replaces it with the new one.

Overall, I think that there are a lot of experiments like this that could be done to shed light on this phenomenon that I’d be quite excited about—both from the general perspective of understanding inductive biases better and from the specific perspective of being able to better predict and understand the dynamics of deceptive alignment.

New Comment
13 comments, sorted by Click to highlight new comments since:

I'd want to know further details of the experiment setup (e.g. how do you prevent it from building the diamond house, how long + effective is the finetuning done for, what were the details of the original training for diamond houses), but I expect I'd be happy to take a 10:1 bet against goal stickiness in this context, i.e. in stage 2 I predict the agent mostly (something like "> 90% of the time") does not build diamond houses. (Assuming of course that the person I'm betting with is also going off of priors, rather than having already observed the results of the experiment.)

I agree with this prediction directionally, but not as strongly.

I'd prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.

Expanding a bit on why: I think this will fail because the house-building AI won't actually be very good at instrumental reasoning, so there's nothing for the sticky goals hypothesis to make use of.

To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.

Now, one thing I will say is that, since the difficulty of changing the proxies into long-term goals seems to be essentially constant regardless of the complexity of the training objective—whereas the difficulty of creating a pointer to the training objective scales with that complexity—I think that, if we try to train models on increasingly complex goals, eventually deception will win almost regardless of the baseline “stickiness” level. But understanding that baseline could still be quite important, and it’s something that I think we can quite readily experiment with right now.

But the deceptively aligned model also needs "a pointer to training objective", for it to be able to optimize that instrumentally/deceptively, so there doesn't seem to be a penalty in complexity to training on complex goals.

This is similar to my comment on the original post about the likelihood of deceptive alignment, but reading that made it slightly clearer exactly what I disagreed with, hence writing the comment here.

It needs no such pointer in the weights, only in the activations. The deceptive model has to build such a pointer at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded.

It seems like a lot more computationally difficult to, at every forward pass/decision process, derive/build/construct such a pointer. If the deceptive model is going to be doing this every time it seems like it would be more efficient to have a dedicated part of the network that calculates it (i.e. have it in the weights)

Separately, for more complex goals this procedure is also going to be more complex, and the network probably needs to be more complex to support constructing it in the activations at every forward pass, compared to the corrigible model that doesn't need to do such a construction (becauase it has it hard-coded as you say). I guess I'm arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.

whereas the corrigible model needs it to be hardcoded

The corrigible model needs to be able to robustly point to our goals, in a way that doesn't change. One way of doing this is having the goals hardcoded. Another way might be to instead have a pointer to the output of a procedure that is executed at runtime that always constructs our goals in the activations. If the deceptive model can reliably construct in it's activations something that actually points towards our goals, then the corrible model could also have such a procedure, and make it's goal be a pointer to the output of such a procedure. Then the only difference in model complexity is that the deceptive model points to some arbitrary attribute of the world model (or whatever), and the aligned model points to the output of this computation, that both models posses.

I think at a high level I'm trying to say that any way in which the deceptive model can robustly point at our goals such that it can pursue them instrumentally, the aligned model can robustly point at them to pursue them terminally. SGD+DL+whatever may favour one way of another of robustly pointing at such goals (either in the weights, or through a procedure that robustly outputs them in the activations), but both deceptive and aligned models could make use of that.

I guess I'm arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.

No additional complexity is required to rederive that it should try to be deceptive—it's a convergent instrumental goal that falls out of any generic optimization process that essentially all models should have to have anyway to be able to get good performance on tasks that require optimization.

Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren't aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I'm arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn't need to rederive the content of this goal (that is, our goal) at every decision.

Alternatively, the aligned model could use the same derivation process to be aligned: The deceptive model has some long-term goal, and in pursuing it rederives the content of the instrumental goal "do 'what the training process incentives'", and the alignment model has the long-term goal "do 'what the training process incentivises'" (as a pointer/de dicto), and also rederives it with the same level of complexity. I think "do 'what the training process incetivises'" (as a pointer/de dicto) isn't a very complex long-term goal., and feels likely to be as complex as the arbitrary crystallised deceptive AI's internal goal, assuming both models have full situation awareness of the training process and hence such a pointer is possible, which we're assuming they do.

(ETA/Meta point: I do think deception is a big issue that we definitely need more understanding of, and I definitely put weight on it being a failure of alignment that occurs in practice, but I think I'm less sure it'll emerge (or less sure that your analysis demonstrates that). I'm trying to understand where we disagree, and whether you've considered the doubts I have and you possess good arguments against them or not, rather than convince you that deception isn't going to happen.)

Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren't aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I'm arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn't need to rederive the content of this goal (that is, our goal) at every decision.

My argument is that the only complexity necessary to figure out what our goals are is the knowledge of what our goals are. And that knowledge is an additional complexity that should be present in all models, not just deceptive models, since all models should need to understand those sorts of facts about the world.

When you say "the knowledge of what our goals are should be present in all models", by "knowledge of what our goals are" do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:

The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded

I guess I don't understand what it would mean for the deceptive AI to have the knowledge of what are goals are (in the world model), but for that not to mean it doesn't have a hard-coded pointer to what our goals are. I'd imagine that what it means for the world model to capture what our goals are is exactly having such a pointer to them.

(I realise I've been failing to do this, but it might make sense to use AI when we mean the outer system and model when we mean the world model. I don't think this is the core of the disagreement, but it could make the discussion clearer. For example, when you say the knowledge is present in the model, do you mean the world model or the AI more generally? I assumed the former above.)

To try and run my (probably inaccurate) simulation of you: I imagine you don't think that's a contradiction above. So you'd think that "knowledge of what our goals are" doesn't mean a pointer to our goals in all the AI's world models, but something simpler, that can be used to figure out what our goals are by the deceptive AI (e.g. in it's optimisation process), but wouldn't enable the aligned AI to use as its objective a simpler pointer and instead would require the aligned AI to hard-code the full pointer to our goals (where the pointer would be pointing into it's the world model, and probably using this simpler information about our goals in some way). I'm struggling to imagine what that would look like.

I've now had a conversation with Evan where he's explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it's likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you'd need to fully encode the training objective.

Deceptive alignment feels like a failure of reflection, a model not being in equilibrium with its episodes. If a behavior is not expressed in episodes sampled in the model's scope, it's not there at all. If it is expressed and is contrary to the alignment target, then either the alignment target is not looking at the episodes correctly to adjust them (when they are within the scope of the target), or the episodes were constructed incorrectly to extend the scope of the model, beyond where it's aligned.

I think a model must never be knowingly exercised off-distribution (risking robustness), that should be a core alignment principle of a system that uses models. By itself, learning only interpolates, doesn't extrapolate. Language models are failing at this, missing this critical safety feature. They don't know what they don't know, and happily generate nonsense in response to off-distribution inputs (though empirically it seems easy to fix in some form, that's not the level of care this issue deserves). Extending the scope (beyond what's expected by the model of the scope) should be deliberate, with specifically crafted training data, whose alignment is conferred by alignment of the systems that generate it.