One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn't actually the main problem which needs to be solved here.
(Or perhaps you're intentionally ignoring that problem by assuming "goal-alignment"?)
I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.
I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.
I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)
Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"
For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences.
At this point, I see two possibilities:
So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"
This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers' goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk.
(A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares's model is not about it, since it assumes autonomous ASI.)
A Sharp Left Turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling). This post will outline our current understanding of the most promising plan for getting through an SLT and how it could fail (conditional on an SLT occurring).
In a previous post, we broke down the SLT threat model into 3 claims:
We proposed some possible mechanisms for Claim 1, while this post will investigate possible arguments and mechanisms for Claim 2.
Plan: we use alignment techniques to find a goal-aligned model before SLT occurs, and the model preserves its goals during the SLT.
We can try to learn a goal-aligned model before SLT occurs: a model that has beneficial goals and is able to reason about its own goals. This requires the model to have two properties: goal-directedness towards beneficial goals, and situational awareness (which enables the model to reason about its goals). Here we use the term "goal-directedness" in a weak sense (that includes humans and allows incoherent preferences) rather than a strong sense (that implies expected utility maximization).
One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails (e.g. it might be fine if interpretability or ELK techniques no longer work reliably during the transition if we can trust the model to manage the transition).
Step 1: Finding a goal-aligned model before SLT
We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment.
Thus, our model search process would follow a decision tree along these lines:
Here are some ways alignment techniques could be applied in this plan:
Why alignment techniques could fail to find a goal-aligned model before SLT occurs:
It's also possible that this model search process would pose an overly high alignment tax to be followed in practice (see Claim 3).
Step 2: The goal-aligned model preserves its goals during SLT (with some help from us)
This step of the plan relies on the goal-aligned model both wanting to preserve its goals during an SLT and being capable of preserving its goals. The general argument for a model wanting to preserve its own goals is that it has a convergent instrumental incentive to ensure that future versions of itself act in ways that satisfy its current preferences.
One argument for a pre-SLT system being capable of preserving its goals is by analogy to human values being preserved when humans went through the capability phase transition of developing language and writing. However, it is not clear how much the human analogy constitutes evidence for ability to preserve goals through a phase transition, since there are other possible explanations for human values being preserved (e.g. that values like "relationships" and "social status" continued to be selected for during the capability transition) that don't rely on humans being able to deliberately preserve their values.
A key question is whether the model could reflect on its goals without already having generalized capabilities. This is related to Claim 1a in our previous post: that the model's capabilities will all generalize at the same time. If this claim holds, in the sense that reflection on goals is part of this set of capabilities that all generalize simultaneously, then the ability to reflect on goals would only arise during the SLT and so this plan doesn't work. However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them.
If the model can predict that an SLT is likely to modify its goals in an undesirable way, we expect that it would choose not to undergo an SLT in the first place (e.g. by avoiding new internal algorithms or ways of thinking that could lead to an SLT).
Paul Christiano outlined some specific ways for a goal-aligned model to preserve its goals during SLT depending on the mechanism for SLT:
We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard and an SLT would not produce a completely coherent system.
Some ways a goal-aligned model could fail to preserve its goals:
Some ways that humans could fail to help the model to preserve its goals:
Takeaways
The above is our current model of the most promising plan for managing an SLT and how it could fail. The overall takeaways are:
The core reasons to be skeptical of this plan are:
If we missed any important components of this plan or ways it could fail, please let us know!