This might be the most valuable article on alignment yet written, IMO. I don't have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this.
This is so valuable because of the effort you've put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore "lock in" its goals (and around the same point, the intelligence to escape human control if it decides to).
Of course this work builds on a lot of other import...
I really like this general diretion of work: suggestions for capabilities that would also help with understanding and controlling network behavior. That would in turn be helpful for real alignment of network-based AGI. Proposing dual-use capabilities advances seems like a way to get alignment ideas actually implemented. That's what I've done in System 2 Alignment, although that's also a prediction about what developers might try for alignment by default.
Whether the approach you outline here would work is an empirical question, but it sounds likely en...
> we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right?
I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT.
If that's al...
I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.
There's a lot here. I won't try to respond to all of it right now.
I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.
Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it w...
Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.
This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This ap...
Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.
I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to...
To summarize:
Despite thinking the core logic is almost that simple, I think it's useful to have this set of thinking laid out so carefully and in the detail you give here.
I am also still a bit confused as to why this careful presentation is useful. I find the logic so compelling that needing to be walked carefully through it seems strange to me. And yet there are intelligent and well-info...
If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.
This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.
Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace,...
What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close).
Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.
This seems productive.
I don't understand your proposal if it doesn't boil down to "do what the principal wants" or "do what the principal says" (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.
This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don't want the AGI to guess at what I want any more than necessary. This...
This sounds like we're saying the same thing? My "at all times" is implied and maybe confusing. I'm saying it doesn't guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it's just having the agent's goals be a pointer to the principal's goals.
This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility see...
I think you're right to point to this issue. It's a loose end. I'm not at all sure it's a dealbreaker for corrigibility.
The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to...
I applaud the work; detailed thinking about control efforts is one useful alignment strategy.
I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.
I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we're talking about isn't made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in partic...
Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model's final output on any given task.
So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense.
This doesn't provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the ...
I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.
Unfortunately, I mostly agree with these positions. I'm afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it's not enough to make me truly optimistic. My p(doom) is right around the 50% "who knows" mark.
I'l...
I probably should've titled this "the alignment stability problem in artificial neural network AI". There's plenty of work on algorithmic maximizers. But it's a lot trickier if values/goals are encoded in a network's distributed representations of the world.
I also should've cited Alex Turner's Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that's not quite good enough. Even if it avoids sudden value drift, at least...
Fascinating. I find the core logic totally compelling. LLM must be narratologists, and narratives include villains and false fronts. The logic on RLHF actually making things worse seems incomplete. But I'm not going to discount the possibility. And I am raising my probabilities on the future being interesting, in a terrible way.
I don't think this goes through. If I have no preference between two things, but I do prefer to not be money-pumped, it doesn't seem like I'm going to trade those things so as to be money-pumped.
I am commenting because I think this might be a crucial crux: do smart/rational enough agents always act like maximizers? If not, adequate alignment might be much more feasible than if we need to find exactly the right goal and how to get it into our AGI exactly right.
Human preferences are actually a lot more complex. We value food very highly when hungry and water...
This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end.
Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.
This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.
You assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That's concerning given your familiarity with their culture. Of course they still might easily go that way if there's a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.
Indeed! This is most of why I'm happy about this -- from internal discussions it's long been speculated (since '23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It's cool that they actually did the experiments to prove it, and in a real-world setting too! But it's even cooler that they published this with a strong recommendation attached.
Now we see how long this recommendation holds up under the pressure of incentives.
Sometime in the next few years probably,... (read more)