Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. Now I'm applying what I've learned to the study of AI alignment. 

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll do the obvious thing: design it to follow instructions. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Comments

Sorted by

> we're really training LLMs mostly to have a good world model and to follow instructions

I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right? 

I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT  works against faithfulness/transparency of the CoT.

If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely.

But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to  give the agent explicit goals, too. 

Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them.

I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work. 


WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment.

I'm not sure I'm more optimistic than you, but I am quite uncertain about how well the likely (low but not zero effort/thought) methods of aligning network-based AGI might go. I think others should be more uncertain as well. Some people being certain of doom while others with real expertise thinking it's probably going to be fine should be a signal that we do not have this worked through yet. 

That's why I like this post and similar attempts to resolve optimist/pessimist disagreements so much.

I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.

There's a lot here. I won't try to respond to all of it right now.

I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.

Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it will generalize relatively well.

When alignment is based on a relatively simple reward model based on simple sensory representations, it won't generalize very well. That's the case with humans. The reward model is running on sensory representations (it has to so they can be specified in the limited information space of DNA, as you and others have discussed).

Alignment generalizes farther than capabilities in well-educated, carefully considered modern humans because our goals are formulated in terms of concepts. (There are still ways this could go awry, but I think most modern humans would generalize their goals well and lead us into a spectacular future if they were in charge of it).

This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible. If we could use natural language (or another route to conceptual representations) to specify an AI's goals, it seems like that would produce better generalization than just trying to train stuff in with RL to produce behavior we like in the training environment. 

One method of "conceptual alignment" is the variant of your Plan for mediocre alignment of brain-like [model-based RL] AGI in which you more or less say to a trained AI "hey think about human flourishing" and then set the critic system's weights to maximum. Another is alignment-by-prompting for LLM-based agents; I discuss that in Internal independent review for language model agent alignment. I'm less optimistic now than when I wrote that, given the progress made in training vs. scripting for better metacognition - but I'm not giving up on it just yet.  Tan Zhi Xuan makes the point in this interview that we're really training LLMs mostly to have a good world model and to follow instructions, similar to Andrej Karpathy's point that RLHF is just barely RL. It's similar with RLAIF and the reward models training R1 for usability, after the pure RL on verifiable answers. So we're still training models to have good world models and follow instructions. Played wisely, it seems like that could produce aligned LLM agents (should that route reach "real AGI").
 

That's a new formulation of an old thought, prompted by your framing of pitting the arguments for capabilities generalizing farther than alignment (for evolution making humans) and alignment generalizing farther than capabilities (for modern humans given access to new technologies/capabilities).

The alternative is trying to get an RL system to "gel" into a concept-based alignment we like. This happens with a lot of humans, but that's a pretty specific set of innate drives (simple reward models) and environment. If we monitored and nudged the system closely, that might work too.

Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.

This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This approach seems likely to work - but I want to hear more pushback on it before I'd trust it in practice.

I think this is not only an interesting dodge around this class of alignment concerns, but it's the most likely thing to actually be implemented. When someone is actually getting close to launching a system they hope is or will become smarter than they are, they'll think a little harder about making its central goal "solve cancer" or anything else broad and consequentialist. The natural choice is to just extend what LLMs are mostly aligned for now: following instructions, including consequentialist instructions.

This logic is all laid out in more detail in Instruction-following AGI is easier and more likely than value aligned AGI, but I didn't specifically address scheming there.

Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.

I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal.

I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.

To summarize:

  • AGI will have goals, because we want it to accomplish things for us
  • Scheming is often helpful in accomplishing goals
  • AGIs will realize they should scheme, then scheme.

Despite thinking the core logic is almost that simple, I think it's useful to have this set of thinking laid out so carefully and in the detail you give here.

I am also still a bit confused as to why this careful presentation is useful. I find the logic so compelling that needing to be walked carefully through it seems strange to me. And yet there are intelligent and well-informed people who say things like "there's no empirical evidence for scheming in AIs" in all seriousness. So I'd like to understand that perspective better.

While I don't fully understand the perspective that needs to be convinced that scheming is likely, I do have some guesses. I think on the whole it stems from understanding current AI systems well, and reasoning from there. Current systems do not really scheme; they lack the capacity. Those who reason by analogy with humans or with fictional or hypothetical generally superintelligent AI see scheming as extremely likely from a misaligned AGI, because they're assuming it will have all the necessary cognitive capacities.

There are more nuanced views, but I think those are the two starting points that generate this dramatic difference in opinions.

Some more specific common cruxes of disagreement on scheming likelihood:

  • Will AGI have consequentialist goals?
  • Will AGI be able to do novel reasoning?
  • Will AGI be able to do long-term planning?
  • Will AGI be able to learn continuously from experience?

I see the answers to all of these questions as being obviously, inevitably yes by default; all of these are useful, so we will keep building toward AGI with all of these capacities if nothing stops us. Having extremely useful transformative limited AGI (like super-foundation models) would not stop us from building "real AGI" with the above properties.

I've tried to convey why those properties seem so inevitable (and actually rather easy to add from here) in real AGI, Steering subsystems: capabilities, agency, and alignment, and Sapience, understanding, and "AGI", among snippets in other places. I'm afraid none of them is as clear or compelling as I'd like from the perspective of someone who starts reasoning from current AI and asks why or how would we include those dangerous properties in our future AGIs?

That's why I'm glad you guys are taking a crack at it in a more careful and expansive way, and from the perspective of how little we'd need to add to current systems to make them solve important problems, and how that gives rise to scheming. I'll be referencing this post on this point.

Edit note: Most of this was written after an accidental premature submit ctrl-return action.

If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.

This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.

Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace, and hope I conveyed the proper intuition. Or I can design an AGI with the goal of doing what I tell it to do (or maybe "what I want"), then telling it to achieve world peace and giving it the definition. In the second case, its goal is similar but not identical to the first. It's a pointer to that goal instead of a direct definition. Thus, if the goal changes in my head (or my instructions in that variant), the pointer's value changes. In the first direct definition case, it does not.

Goals as pointers is not something any evolved system really has, but it seems logically consistent and to have major advantages for alignment. More on that in my instruction-following AGI... post.

What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close).

Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.

This seems productive.

I don't understand your proposal if it doesn't boil down to "do what the principal wants" or "do what the principal says" (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.

This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don't want the AGI to guess at what I want any more than necessary. This has downsides, too; back to those at the end.

I do think what your model of me says, but I think it's only narrowly true and probably not very useful that

It's fine if the AGI does what I want and not what I say, as long as it's correct about what I want.

I think this is true for exactly the right definition of "what I want", but conveying that to an AGI is nontrivial, and re-introduces the difficulty of value learning. That's mixed with the danger that it's incorrect about what I want. That is, it could be right about what I want in one sense, but not the sense I wanted to convey to it (E.G., it decides I'd really rather be put into an experience machine where I'm the celebrated hero of the world, rather than make the real world good for everyone like I'd hoped to get).

Maybe I've misunderstood your thesis, but I did read it pretty carefully, so there might be something to learn from how I've misunderstood. All of your examples I remember correspond to "doing what the principal wants" by a pretty common interpetation of that phrase.

Instruction-following puts a lot of the difficulty back on the human(s) in charge. This is potentially very bad, but I think humans will probably choose this route anyway. You've pointed out some ways that following instructions could be a danger (although I think your genie examples aren't the most relevant for a modest takeoff speed). But I think unless something changes, humans are likely to prefer keeping the power and the responsibility to trying to put more of the project into the AGIs alignment. That's another reason I'm spending my time thinking through this route to corrigibility instead of the one you propose.

Although again, I might be missing something about your scheme.

I just went back and reread 2. Corrigibility Intuition (after writing the above, which I won't try to revise). Everything there still looks like a flavor of "do what I want". My model of Max says "corrigibility is more like 'do your best to be correctable'". It seems like correctable means correctable toward what the principal wants. So I wonder if your formulation reduces to "do what I want, with an emphasis on following instructions and being aware that you might be wrong about what I want". That sounds very much like the Do What I Mean And Check formulation of my instruction-following approach to corrigibility.

Thanks for engaging. I think this is productive.

Just to pop back to the top level briefly, I'm focusing on instruction-following because I think it will work well and be the more likely pick for a nascent language-model agent AGI, from below human level to somewhat above it. If RL is heavily involved in creating that agent, that might shift the balance and make your form of corrigibility more attractive (and still vastly more attractive than attempting value alignment in any broader way). I think working through both of these is worthwhile, because those are the two most likely forms of first AGI, and the two most likely actual alignment targets. 

I definitely haven't wrapped my head around all of the pitfalls with either method, but I continue to think that this type of alignment target makes good outcomes much more likely, at least as far as we've gotten with the analysis so far.

I think this type of alignment target is also important because the strongest and most used arguments for alignment difficulty don't apply to them. So when we're debating slowing down AGI, proponents of going forward will be talking about these approaches. If the alignment community hasn't thought through them carefully, there will be no valid counterargument. I'd still prefer that we slow AGI even though I think these methods give us a decent chance of succeeding at technical alignment. So that's one more reason I find this topic worthwhile.

This has gotten pretty discursive, so don't worry about responding to all of it.

This sounds like we're saying the same thing? My "at all times" is implied and maybe confusing. I'm saying it doesn't guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it's just having the agent's goals be a pointer to the principal's goals.

This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility seems coherent and simple.

Edit: to address your example, I both want and don't-want to be manipulated into wanting to exercise next week. It's confusing for me, so it should be confusing for my corrigible AGI. It should ask me to clarify when and how I want to be manipulated, rather than taking a guess when I don't know the answer. I probably haven't thought about it deeply, and overall it's pretty important to accurately doing what I want, so a good corrigible helper will suggest I spend some time clarifying for it and for myself. This is a point where things could go wrong if it takes bad guesses instead of getting clarification, but there are lots of those.

I think you're right to point to this issue. It's a loose end. I'm not at all sure it's a dealbreaker for corrigibility.

The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to figure out what balance their principal wants, and do that.

I was going to say that my instruction-following variant of corrigibility might be better for working out that balance, but it actually seems pretty straightforward in Max's pure corrigibility version, now that I've written out the above.

Load More