Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades; I studied complex human thought. Now I'm applying what I've learned to the study of AI alignment. 

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way, possibly much to our chagrin. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could actually accomplish that for all of humanity. So I focus on finding alignment solutions.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I've focused on the emergent interactions that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll do the obvious thing: design it to follow instructions. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Comments

Sorted by

If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.

This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.

Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace, and hope I conveyed the proper intuition. Or I can design an AGI with the goal of doing what I tell it to do (or maybe "what I want"), then telling it to achieve world peace and giving it the definition. In the second case, its goal is similar but not identical to the first. It's a pointer to that goal instead of a direct definition. Thus, if the goal changes in my head (or my instructions in that variant), the pointer's value changes. In the first direct definition case, it does not.

Goals as pointers is not something any evolved system really has, but it seems logically consistent and to have major advantages for alignment. More on that in my instruction-following AGI... post.

What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close).

Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.

This seems productive.

I don't understand your proposal if it doesn't boil down to "do what the principal wants" or "do what the principal says" (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.

This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don't want the AGI to guess at what I want any more than necessary. This has downsides, too; back to those at the end.

I do think what your model of me says, but I think it's only narrowly true and probably not very useful that

It's fine if the AGI does what I want and not what I say, as long as it's correct about what I want.

I think this is true for exactly the right definition of "what I want", but conveying that to an AGI is nontrivial, and re-introduces the difficulty of value learning. That's mixed with the danger that it's incorrect about what I want. That is, it could be right about what I want in one sense, but not the sense I wanted to convey to it (E.G., it decides I'd really rather be put into an experience machine where I'm the celebrated hero of the world, rather than make the real world good for everyone like I'd hoped to get).

Maybe I've misunderstood your thesis, but I did read it pretty carefully, so there might be something to learn from how I've misunderstood. All of your examples I remember correspond to "doing what the principal wants" by a pretty common interpetation of that phrase.

Instruction-following puts a lot of the difficulty back on the human(s) in charge. This is potentially very bad, but I think humans will probably choose this route anyway. You've pointed out some ways that following instructions could be a danger (although I think your genie examples aren't the most relevant for a modest takeoff speed). But I think unless something changes, humans are likely to prefer keeping the power and the responsibility to trying to put more of the project into the AGIs alignment. That's another reason I'm spending my time thinking through this route to corrigibility instead of the one you propose.

Although again, I might be missing something about your scheme.

I just went back and reread 2. Corrigibility Intuition (after writing the above, which I won't try to revise). Everything there still looks like a flavor of "do what I want". My model of Max says "corrigibility is more like 'do your best to be correctable'". It seems like correctable means correctable toward what the principal wants. So I wonder if your formulation reduces to "do what I want, with an emphasis on following instructions and being aware that you might be wrong about what I want". That sounds very much like the Do What I Mean And Check formulation of my instruction-following approach to corrigibility.

Thanks for engaging. I think this is productive.

Just to pop back to the top level briefly, I'm focusing on instruction-following because I think it will work well and be the more likely pick for a nascent language-model agent AGI, from below human level to somewhat above it. If RL is heavily involved in creating that agent, that might shift the balance and make your form of corrigibility more attractive (and still vastly more attractive than attempting value alignment in any broader way). I think working through both of these is worthwhile, because those are the two most likely forms of first AGI, and the two most likely actual alignment targets. 

I definitely haven't wrapped my head around all of the pitfalls with either method, but I continue to think that this type of alignment target makes good outcomes much more likely, at least as far as we've gotten with the analysis so far.

I think this type of alignment target is also important because the strongest and most used arguments for alignment difficulty don't apply to them. So when we're debating slowing down AGI, proponents of going forward will be talking about these approaches. If the alignment community hasn't thought through them carefully, there will be no valid counterargument. I'd still prefer that we slow AGI even though I think these methods give us a decent chance of succeeding at technical alignment. So that's one more reason I find this topic worthwhile.

This has gotten pretty discursive, so don't worry about responding to all of it.

This sounds like we're saying the same thing? My "at all times" is implied and maybe confusing. I'm saying it doesn't guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it's just having the agent's goals be a pointer to the principal's goals.

This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility seems coherent and simple.

Edit: to address your example, I both want and don't-want to be manipulated into wanting to exercise next week. It's confusing for me, so it should be confusing for my corrigible AGI. It should ask me to clarify when and how I want to be manipulated, rather than taking a guess when I don't know the answer. I probably haven't thought about it deeply, and overall it's pretty important to accurately doing what I want, so a good corrigible helper will suggest I spend some time clarifying for it and for myself. This is a point where things could go wrong if it takes bad guesses instead of getting clarification, but there are lots of those.

I think you're right to point to this issue. It's a loose end. I'm not at all sure it's a dealbreaker for corrigibility.

The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to figure out what balance their principal wants, and do that.

I was going to say that my instruction-following variant of corrigibility might be better for working out that balance, but it actually seems pretty straightforward in Max's pure corrigibility version, now that I've written out the above.

I applaud the work; detailed thinking about control efforts is one useful alignment strategy.

I find the assumption of no time-awareness in the AI to be unrealistic. Allowing continuous learning is going to be extremely tempting to anyone developing or deploying proto-AGI. Episodic memory (including across sessions) is fairly easy to implement, and can add capabilities as well as save compute costs. Working out useful knowledge and strategies will probably take nontrivial compute costs, and throwing away that knowledge after each session will seem very wasteful of resources and will limit usefulness.

Seth Herd10

What do you mean by "when deceptive alignment fails to work"? I'm confused.

I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we're talking about isn't made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in particular, agentic vs. tool AI and RL-trained vs. predictive AI. These are not sharp categories, but distinguishing what part of the spectrum we're primarily addressing would clarify discussions.

You've done an admirable job of doing that in this post, and doing so seems to make sense of your disagreements with Pope's conclusions.

Pope appears to be talking primarily about LLMs, so the extent to which his logic applies to other forms of AI is unclear. As you note, that logic does not seem to apply to AI that is agentic (explicitly goal-directed), or to actor-critic RL agents.

That is not the only problem with that essay, but it's a big one, since the essay comes to the conclusion that AI is safe, while analyzing only one type of AI.

I agree that human ethics is not the result solely of training, but has a critical component of innate drives to be pro-social. The existence of sociopaths whose upbringing was normal is pretty compelling evidence that the genetic component is causal.

While the genetic basis of prosocial behavior is probably simple in the sense that it is coded in a limited amount of DNA information and neural circuitry, it is likely quite complex in another sense: it is evolved to work properly in the context of a very particular type of environment, that of standard human experience. As such, I find it unlikely that those mechanisms would produce an aligned agent in a very different AI training regime, nor that that alignment would generalize to very different situations than humans commonly encounter.

As you note, even if we restricted ourselves to this type of AI, and alignment was easy, that would not reduce existential risks to near 1%. If powerful AI is accessible to many, someone is going to either make mistakes or deliberately use it destructively, probably rather quickly.

Your first point seems absolutely critical. Could you elaborate a little?

I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.

Unfortunately, I mostly agree with these positions. I'm afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it's not enough to make me truly optimistic. My p(doom) is right around the 50% "who knows" mark.

I'll restate the main claims as:

  1. We only get one chance
  2. We have to get the AIs values exactly aligned with human values
    1. There will be a sharp discontinuity as an AI becomes truly generally intelligent
    2. the process of value reflection seems highly unstable
  3. No known method of dodging this problem is likely to work

The source of all most my disagreement with you is in the type of AGI we expect. I expect (with above 50% probably) AGI to arise from the expansion of LLMs into language model based cognitive architectures that use LLMs as the core engine, but expand on them in a chain-of-thought, and allow them to use external tools. These expectations are entirely theoretical since AutoGPT and HuggingGPT were only released about a month or so ago. My post Capabilities and alignment of LLM cognitive architectures elaborates on why I expect these to work well. 

I think such systems will readily become weakly general (at variance from your expectation of a more all-or-nothing transition) by learning about new domains through web search and experimentation with their cognitive tools, storing that knowledge in episodic memory. (I also think that before long, they will use that episodic, declarative knowledge to fine-tune the central LLM, much as humans absorb new knowledge into skills).  Importantly, I expect this generality to extend to understanding themselves as systems, and thereby giving rise to something like value reflection.

This is bad because it advances timelines if true, but really good in that such systems can be run without using any RL or persistent context in one LLM.

None of the above considerations are in that post; I'm writing another that focuses on them.

In that scenario, I expect us to get a few shots, as the transition to truly general will be slow and happen in highly interpretable natural language agent systems. There are still many dangers, but I think this would massively improve our odds.

Whether or not AGI arises from that or a different network-based system, I agree that the value reflection process is unpredictable, so we may have to get value alignment exactly right. I expect the central strongest value to be preserved in a reflective value-editing process. But that means that central value has to be exactly right. Whether any broader configuration of values might be stable in a learning network is unknown, and I think worthy of a good deal more thought.

One random observation: I think your notion of general intelligence overlaps strongly with the common concept of recursive self improvement, which many people do include in their mental models.

Anyway, thanks for an insightful post that nails a good deal of the variance between my model of the average alignment optimist and pessimist.

Load More