All of Seth Herd's Comments + Replies

This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.

You assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That's concerning given your familiarity with their culture. Of course they still might easily go that way if there's a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.

Indeed! This is most of why I'm happy about this -- from internal discussions it's long been speculated (since '23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It's cool that they actually did the experiments to prove it, and in a real-world setting too! But it's even cooler that they published this with a strong recommendation attached. 

Now we see how long this recommendation holds up under the pressure of incentives.

Sometime in the next few years probably,... (read more)

This might be the most valuable article on alignment yet written, IMO. I don't have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this.

This is so valuable because of the effort you've put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore "lock in" its goals (and around the same point, the intelligence to escape human control if it decides to).

Of course this work builds on a lot of other import... (read more)

3Daniel Kokotajlo
Thanks! I'm so glad to hear you like it so much. If you are looking for things to do to help, besides commenting of course, I'd like to improve the post by adding in links to relevant literature + finding researchers to be "hypothesis champions," i.e. to officially endorse a hypothesis as plausible or likely. In my ideal vision, we'd get the hypothesis champions to say more about what they think and why, and then we'd rewrite the hypothesis section to more accurately represent their view, and then we'd credit them + link to their work. When I find time I'll do some brainstorming + reach out to people; you are welcome to do so as well.

I really like this general diretion of work: suggestions for capabilities that would also help with understanding and controlling network behavior.  That would in turn be helpful for real alignment of network-based AGI. Proposing dual-use capabilities advances seems like a way to get alignment ideas actually implemented. That's what I've done in System 2 Alignment, although that's also a prediction about what developers might try for alignment by default.

Whether the approach you outline here would work is an empirical question, but it sounds likely en... (read more)

2Abram Demski
Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.

> we're really training LLMs mostly to have a good world model and to follow instructions

I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right? 

I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT  works against faithfulness/transparency of the CoT.

If that's al... (read more)

2Steve Byrnes
I think I’ll duck out of this discussion because I don’t actually believe that o1/r1 will lead to full-fledged (1-3) loops and AGI, so it’s hard for me to clearly picture that scenario and engage with its consequences. Hmm. But the AI has a ton of wiggle room to make things seem good or bad depending on how things are presented and framed, right? (This old Stuart Armstrong post is a bit relevant.) If I ask “what will happen if we do X”, the AI can answer in a way that puts things in a positive light, or a negative light. If the good understanding lives in the AI and the good taste lives in the human, then it seems to me that nobody is at the wheel. The AI taste is determining what gets communicated to the human and how, right? What’s relevant vs irrelevant? What analogies are getting at what deeply matters versus what analogies are superficial? All these questions are value-laden, but they are prerequisites to the AI communicating its understanding to the human. Remember, the AI is doing the (1-3) thing to autonomously develop a new idiosyncratic superhuman understanding of AI and philosophy and society and so on, by assumption. Thus, AI-human communication is much harder and different than we’re used to today, and presumably requires its own planning and intention on the part of the AI. …Unless you’re actually in the §5.1.1 camp where the AI is helping clarify and brainstorm but is working shoulder-to-(virtual) shoulder, and the human basically knows everything the AI knows. I.e., like how people use foundation models today. If so, that’s fine, no complaints. I’m happy for people to use foundation models in a similar way that they do today, as they work on the big problem of how to make future more powerful AIs that run on something closer to ambitious value learning or CEV as opposed to corrigibility / obedience. Sorry if I’m misunderstanding or being stupid, this is an area where I feel some uncertainty.  :)

I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.

There's a lot here. I won't try to respond to all of it right now.

I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.

Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it w... (read more)

3Steve Byrnes
Thanks! Yeah that’s what I was referring to in the paragraph: Separately, you also wrote: I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right? (See: o1 is a bad idea.) Then your reply is: DeepSeek r1 was post-trained for “correctness at any cost”, but it was post-post-trained for “usability”. Even if we’re not concerned about alignment faking during post-post-training (should we be?), I also have the idea at the back of my mind that future SOTA AI with full-fledged (1-3) loops probably (IMO) won’t be trained in the exact same way than present SOTA AI, just as present SOTA AI is not trained in the exact same way as SOTA AI as recently as like six months ago. Just something to keep in mind. Anyway, I kinda have three layers of concerns, and this is just discussing one of them. See “Optimist type 2B” in this comment.

Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.

This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This ap... (read more)

4Marius Hobbhahn
Some questions and responses: 1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer". 2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked into its personality and these can function like goals. Many of the demonstrations that we have for scheming are cases where the model is too much of a saint, i.e. it schemes for the right cause. For example, it might be willing to deceive its developers if we provide it with strong reasons that they have non-HHH goals. I'm not really sure what to make of this. I guess it's good that it cares about being harmless and honest, but it's also a little bit scary that it cares so much.  My best guess for how the approach should look is that some outcome-based RL will be inevitable if we want to unlock the benefits, we just have to hammer the virtues of being non-scheming and non-power-seeking into it at all points of the training procedure. And we then have to add additional lines of defense like control, interpretability, scalable oversight, etc. and think hard about how we minimize correlated failures. But I feel like right now, we don't really have the right tools, model organisms, and evals to establish whether any of these lines of defense actually reduce the problem. 

Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.

I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to... (read more)

3Marius Hobbhahn
Good point. That's another crux for which RL seems relevant.  From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it's almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now.  So the story where we misspecify the goal, the model realizes that the given goal differs from the intended goal and decides to scheme is also less likely.  Instead, there has to be a component where the AIs goals substantially change over time from something that we actually intended to something misaligned. Again, outcome-based RL and instrumental convergence yield a plausible answer. 

To summarize:

  • AGI will have goals, because we want it to accomplish things for us
  • Scheming is often helpful in accomplishing goals
  • AGIs will realize they should scheme, then scheme.

Despite thinking the core logic is almost that simple, I think it's useful to have this set of thinking laid out so carefully and in the detail you give here.

I am also still a bit confused as to why this careful presentation is useful. I find the logic so compelling that needing to be walked carefully through it seems strange to me. And yet there are intelligent and well-info... (read more)

4Marius Hobbhahn
I think it's actually not that trivial.  1. The AI has goals, but presumably, we give it decently good goals when we start. So, there is a real question of why these goals end up changing from aligned to misaligned. I think outcome-based RL and instrumental convergence are an important part of the answer. If the AI kept the goals we originally gave it with all side constraints, I think the chances of scheming would be much lower.  2. I guess we train the AI to follow some side constraints, e.g., to be helpful, harmless, and honest, which should reduce the probability of scheming. I also think that RLHF empirically works well enough that the model behaves as intended most of the time. So, for me, there is a real question of how the model would go from this HHH persona to something that is much more goal-directed and willing to break virtues like "don't consistently lie to your developers." Again, outcome-based RL seems like a crucial component to me. 

If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.

This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.

Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace,... (read more)

2Rubi Hudson
I agree that goals as pointers could have some advantages, but I don't see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it's goal switched. 

What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close).

Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.

4Buck Shlegeris
My guess is that it's infeasible to ask for them to delay til they have a real safety case, due to insufficient coordination (including perhaps international competition).

This seems productive.

I don't understand your proposal if it doesn't boil down to "do what the principal wants" or "do what the principal says" (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.

This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don't want the AGI to guess at what I want any more than necessary. This... (read more)

This sounds like we're saying the same thing? My "at all times" is implied and maybe confusing. I'm saying it doesn't guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it's just having the agent's goals be a pointer to the principal's goals.

This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility see... (read more)

1Max Harms
It sounds like you're proposing a system that is vulnerable to the Fully Updated Deference problem, and where if it has a flaw in how it models your preferences, it can very plausibly go against your words. I don't think that's corrigible. In the specific example, just because one is confused about what they want doesn't mean the AI will be (or should be). It seems like you think the AGI should not "take a guess" at the preferences of the principal, but it should listen to what the principal says. Where is the qualitative line between the two? In your system, if I write in my diary that I want the AI to do something, should it not listen to that? Certainly the diary entry is strong evidence about what I want, which it seems is how you're thinking about commands. Suppose the AGI can read my innermost desires using nanomachines, and set up the world according to those desires. Is it corrigible? Notably, if that machine is confident that it knows better than me (which is plausible), it won't stop if I tell it to shut down, because shutting down is a bad way to produce MaxUtility. (See the point in my document, above, where I discuss Queen Alice being totally disempowered by sufficiently good "servants".) My model of Seth says "It's fine if the AGI does what I want and not what I say, as long as it's correct about what I want." But regardless of whether that's true, I think it's important not to confuse that system with one that's corrigible.

I think you're right to point to this issue. It's a loose end. I'm not at all sure it's a dealbreaker for corrigibility.

The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to... (read more)

2Max Harms
I don't think "a corrigible agent wants to do what the principal wants, at all times" matches my proposal. The issue that we're talking here shows up in the math, above, in that the agent needs to consider the principal's values in the future, but those values are themselves dependent on the agent's action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn't seem necessarily corrigible, even if it matches the agent's sense of the present principal's values. For instance, suppose Monday-Max wants Tuesday-Max to want to want to exercise, but also Monday-Max feels a bunch of caution around self-modification such that he doesn't trust having the AI rearrange his neurons to make this change. It seems to me that the corrigible thing for the AI to do is ignore Monday-Max's preferences and simply follow his instructions (and take other actions related to being correctable), even if Monday-Max's mistrust is unjustified. It seems plausible to me that your "do what the principal wants" agent might manipulate Tuesday-Max into wanting to want to exercise, since that's what Monday-Max wants on the base-level.

I applaud the work; detailed thinking about control efforts is one useful alignment strategy.

I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.

4Buck Shlegeris
I agree re time-awareness, with two caveats: * The kind of mechanism you listed probably only allows the AIs to have a rough idea of what time it is. * We can keep some of the instances of our AI very unaware of the time, by restricting their between-episode memory. For example, we might do this for the instances responsible for untrusted monitoring, to reduce collusion.

What do you mean by "when deceptive alignment fails to work"? I'm confused.

5Steve Byrnes
I think Noosphere89 meant to say “when deceptive alignment doesn’t happen” in that sentence. (They can correct me if I’m wrong.) Anyway, I think I’m in agreement with Noosphere89 that (1) it’s eminently reasonable to try to figure out whether or not deceptive alignment will happen (in such-and-such AI architecture and training approach), and (2) it’s eminently reasonable to have significantly different levels of overall optimism or pessimism about AI takeover depending on the answer to question (1). I hope this post does not give anyone an impression contrary to that.

I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we're talking about isn't made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in partic... (read more)

Your first point seems absolutely critical. Could you elaborate a little?

Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model's final output on any given task.

So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense. 

This doesn't provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the ... (read more)

I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.

Unfortunately, I mostly agree with these positions. I'm afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it's not enough to make me truly optimistic. My p(doom) is right around the 50% "who knows" mark.

I'l... (read more)

1Roger Dearnaley
This is a major crux for me, and one of the primary reasons my P(DOOM) isn't >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words 'human' and 'values' mean and what the definition of "human values" is, like any small LLM has), and b) not kill everyone while it's learning the details, and it will do its research and Bayesianly converge on human values (and if it's not capable of competently being Bayesian enough to do that, it's not superhuman, at least at STEM). So, if you use value learning, the only piece you need to get exactly right (for outer alignment) is the phrasing of the terminal goal saying "Use value learning". For something containing an LLM, I think that might be about one short paragraph of text, possibly with one equation in it. The prospect of getting one paragraph of text and one equation right, with enough polishing and peer review, doesn't actually seem that daunting to me.

I probably should've titled this "the alignment stability problem in artificial neural network AI". There's plenty of work on algorithmic maximizers. But it's a lot trickier if values/goals are encoded in a network's distributed representations of the world.

I also should've cited Alex Turner's Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that's not quite good enough. Even if it avoids sudden value drift, at least... (read more)

Fascinating. I find the core logic totally compelling. LLM must be narratologists, and narratives include villains and false fronts. The logic on RLHF actually making things worse seems incomplete. But I'm not going to discount the possibility. And I am raising my probabilities on the future being interesting, in a terrible way.

I don't think this goes through. If I have no preference between two things, but I do prefer to not be money-pumped, it doesn't seem like I'm going to trade those things so as to be money-pumped.

I am commenting because I think this might be a crucial crux: do smart/rational enough agents always act like maximizers? If not, adequate alignment might be much more feasible than if we need to find exactly the right goal and how to get it into our AGI exactly right.

Human preferences are actually a lot more complex. We value food very highly when hungry and water... (read more)

This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end. 

Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.