Conditioning Generative Models for Alignment

Thanks!

I suspect that there's a ton of room to get more detailed here, and some of the claims or conclusions you reach in this post feel too tenuous, barring that more detailed work. I will give some feedback that probably contains some mistakes of my own:

Conditioning is hard, and the picture you paint sometimes doesn't distinguish between conditioning on input data (prompting) and conditioning on latent states of the learned model. But the latter is quite tricky, and if it's necessary, then questions about whether or not this whole approach makes the problem easier might depend on details of how it's implemented, and how well it works.
If you condition on a superhuman, superhelpful AI alignment researcher, is it robust or fragile? That is, do most ways of trying the prompt get more or less the same results, or can you use prompts that seem similar a priori to get radically different advice from the AI? This is an important question, and in many crucial ways it's not a technical question, instead it's a question about what you consider "plausible a priori prompts," and how you measure success - what sort of advice you actually accept as coming from a good advisor.
Failure at this previous bullet point will give you failure modes that look less anthropomorphic than you might think. I think it's too hasty to imagine the output as generated by a functional simulation of a human ("picked out" by the prompt) that we can treat as generalizing to new contexts (where "superhuman alignment researcher" is a very new context) in human-like ways. Generalization failures might imply a breakdown of human thought processes as a useful model of what's generating the text.
Two relatively minor nitpicks: Generative models are more often trained self-supervised than just plain supervised (sentence 1 of section 1). Also, I think that self-fulfilling prophecies are more a problem when the use case more closely reflects the training case, but because of the disconnect I expect you'll see different issues, especially in the category of ways for mesa-objectives to leak into the output stream.

[-]Jozdien3y*30

Thanks for the feedback!

I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.

I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
That's a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I'm talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states.

For more practical models though (especially when we're trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale - in other words, that they depend less on stuff like phrasing and process the prompts more "object-level" in some sense as they get more powerful.

If the problem you're pointing to is generally that the textual distribution fails in ways that the reality prior wouldn't given a sufficiently strong context switch - then I agree that's possible. My guess is that this wouldn't be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergence can't be strong enough to have a qualitative difference or we'd have noticed it in current models, and that future models would have the requisite "parts" to simulate (at least a good) alignment researcher, so it becomes a prompt engineering problem. That said, I think it's still a potential problem whose depths we could understand with more extraction work.
Re the self-supervised comment - oops yeah, that's right. I've edited the post, thanks for the correction. I wrote that line mainly to contrast it with RL and emphasize the "it's learning to model a distribution", so I didn't pay too close attention - I'll try to going forward.
Re the self-fulfilling prophecies comment - could you elaborate on that? I'm afraid I don't fully get your argument.

[-]Charlie Steiner3y20

Re: prompting: So when you talk about "simulating a world," or "describing some property of a world," I interpreted that as conditionalizing on a feature of the AI's latent model of the world, rather than just giving it a prompt like "You are a very smart and human-aligned researcher." This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models.

Re: prophecies: I mean that your training procedure doesn't give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that's a way to be accurate while still having degrees of freedom about how to affect the world. I'm saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.

[-]Jozdien3y10

Sorry for the (very) late reply!

I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here? If so, I have a comment there about why I think it might not really be qualitatively different.

I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the outcome you describe), but to a case where it's still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall). Even for just predicting the next token with high accuracy, it'd need to solve this problem at some point. My prediction is that it's more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it's also possible for it do fixed-point reasoning (like in the Predict-O-Matic story).

[-]Evan R. Murphy3y*20

Fascinating work, thanks for this post.

Using smaller generative models as initializations for larger ones.

(The equivalent ELK proposal goes into this strategy in more detail).

Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)

An acceptability predicate for myopia.

Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key.

I think [non-myopic non-optimizer is a coherent concept] - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

(This is a nitpick and I also don't mean to trivialize the inner alignment problem which I am quite worried about! But I did want to make sure I'm not missing anything here and that I'm broadly on the same page as other reasonable folks about expectations/evidence for mesa-optimizers.)

An acceptability predicate for non-agency.
[...]
If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

This seems like a very broad predicate, however. What would it actually look like?

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

I'm excited, I've explored before how having interpretability that can both reliably detect mesa-optimizers and read-off its goals would have the potential to solve alignment. But I hadn't considered how reliable mesa-optimization detection alone might be enough, because I wasn't considering generative models in that post. (Even if I had, I wasn't yet aware of some of the clever and powerful ways that generative models could be used for alignment that you describe in this post.)

How would we mechanistically incentivize something like non-agency?

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

[-]Jozdien3y20

Sorry for the (very) late reply!

Do you have a link to the ELK proposal you're referring to here?

Yep, here. I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers, I think. In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I'm holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models). I agree that trying to do this right now would probably just result in bad performance.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

I agree that we'll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I'm not very worried about this possibility. Both because the model would have to be simultaneously deceptive in the brief period it's agentic, and because this might not be a very good avenue of attack - it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really "agentifies" whatever is being directly observed / amplified.

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

I agree almost entirely - I was mainly trying to break down the exact capabilities we'd need the interpretability tools to have there. What would detecting mesa-optimizers entail mechanistically, etc.

But I hadn't considered how reliable mesa-optimization alone might be enough, because I wasn't considering generative models in that post.

I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default - I think my main worries are getting it to work before RL reaches AGI-level.

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths - or even if it's just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.

[-]Adam Jermyn3y20

Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?

This seems important and is not a thing I've thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

If they can detect when they’re in deployment, then they could act in malign ways.

The more I've thought about this one the more I'm not worried about this precise danger.

It would be very strange for a predictive model attempting to draw plausible trajectories through time to simulate trajectories in which agents notice inconsistencies and decide that they're in simulations. Agents can still conclude they're in simulations, but it would be weird for this to be because they noticed inconsistencies in their worlds, because the agents and world are being constructed together as part of a predictive task. Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.

That said, there are very closely related dangers that I am worried about. For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world). This feels like a thing we can make less likely with appropriate prompting, which makes me hope that it may not be too big a problem in practice, but (barring powerful interpretability tools) I don't think we can rule it out.

The way these models deal with self-fulfilling prophecies.

I'm currently pretty worried about this, so was happy to see you thinking about it.

[-]Jozdien3y10

Sorry for the (very) late reply!

I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn't enough information to pick out our world from a universal prior, or because it actually describes a different world better), not whether its capabilities or abstractive reasoning don't help with time-separated prediction.

Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.

I think I'm picturing different reasons for a simulacra agent to conclude that they're in a simulation than noticing inconsistencies. Some specifics include worlds that are just unlikely enough anthropically (because of a conditional we apply, for example) to push up credence in a simulation hypothesis, or they notice the effects of gradient descent (behavioural characteristics of the world deviating from "normal" behaviour tend to affect the world state), or other channels that may be available by some quirk of the simulation / training process, but I'm not holding to any particular one very strongly. All of which to say that I agree it'd be weird for them to notice inconsistencies like that.

For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).

Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time.

^{^}

If you’re wondering about the possibility of malign simulations (such as an optimizer “god” overseeing the simulation) and are familiar with the ELK report, one way to view this is to think of the proposal of training a sequence of reporters for successively more powerful predictors, where you can use the weights of a simpler predictor to initialize a larger one and keep the difference at each iteration small; simpler predictors finding direct translation easier than human simulators, this could result in direct translation for more complex predictors too. The downside to this solution (if the model uses discrete modes of prediction, then this approach would stunt its ability to update toward those better modes as it becomes larger) has an analogue in generative models of different kinds of world models. As the generative models keep incrementally building their world model from the previous, it’s difficult to update away enough to malign simulations when they have to keep getting good performance on the next example - generative models are myopic.

^{^}

I’m not sure what these other channels would be, but the agent’s scenario is different enough that I don’t want to rule out the possibility of there being other channels it can get some kind of info from.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

28

Conditioning Generative Models for Alignment

28

How do generative models work?

Why is this important?

Outer Alignment

What does Outer Alignment mean in this context?

Problems with Outer Alignment

Inner Alignment

What does Inner Alignment mean here?

Problems with Inner Alignment

Self-fulfilling prophecies

Potential strategies against deceptive optimizers

Competitiveness

Conclusion

Some open lines of research