Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.

If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.

Important posts:

On the strategic overview of AGI risk:
- TBA, next post
- If we solve alignment, do we die anyway?
  - Risks of human-controlled AGI
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- Seven sources of goals in LLM agents brief problem statement
- System 2 Alignment on how developers will try to align LLM agent AGI
On LLM-based agents as a route to takeover-capable AGI
- Brief argument for short timelines being plausible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
On AGI alignment more broadly
- Instruction-following AGI is easier and more likely than value aligned AGI
- Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
- Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
- AI scares and changing public beliefs

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Posts

Sorted by New

2Seth Herd's Shortform

19Conflating value alignment and intent alignment is causing confusion

7mo

27Instruction-following AGI is easier and more likely than value aligned AGI

16Goals selected from learned knowledge: an alternative to RL alignment

15We have promising alignment plans with low taxes

18The (partial) fallacy of dumb superintelligence

20Internal independent review for language model agent alignment

26Capabilities and alignment of LLM cognitive architectures

16The alignment stability problem

Wikitag Contributions

(+54/-21)

(+81/-187)

(+9/-8)

(+94/-150)

(+1096/-13)

Language model cognitive architecture

7mo

(+223)

Corrigibility

(+472)

Language model cognitive architecture

(+805)

Comments

Sorted by

Newest

OpenAI: Detecting misbehavior in frontier reasoning models

Seth Herd1mo51

This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.

You assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That's concerning given your familiarity with their culture. Of course they still might easily go that way if there's a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.

What goals will AIs have? A list of hypotheses

Seth Herd1mo61

This might be the most valuable article on alignment yet written, IMO. I don't have enough upvotes. I realize this sounds like hyperbole, so let me explain why I think this.

This is so valuable because of the effort you've put into a gears-level model of the AGI at the relevant point. The relevant point is the first time the system has enough intelligence and self-awareness to understand and therefore "lock in" its goals (and around the same point, the intelligence to escape human control if it decides to).

Of course this work builds on a lot of other important work in the past. It might be the most valuable so far because it's now possible (with sufficient effort) to make gears-level models of the crucial first AGI systems that are close enough to allow correct detailed conclusions about what goals they'd wind up locking in.

If this gears-level model winds up being wrong in important ways, I think the work is still well worthwhile; it's creating and sharing a model of AGI, and practicing working through that model to determine what goals it would settle on.

I actually think the question of which of those goals can't be answered given the premise. I think we need more detail about the architecture and training to have much of a guess about what goals would wind up dominating (although strictly following developers intent or closely capturing "the spec" do seem quite unlikely in the scenario you've presented).

So I think this model doesn't yet contain enough gears to allow predicting its behavior (in terms of what goals win out and get locked in or reflectively stable).

Nonetheless, I think this is the work that is most lacking in the field right now: getting down to specifics about the type of systems most likely to become our first takeover-capable AGIs.

My work is attempting to do the same thing. Seven sources of goals in LLM agents lays out the same problem you present here, while System 2 Alignment works toward answering it.

I'll leave more object-level discussion in a separate comment.

Dream, Truth, & Good

Seth Herd2mo10

I really like this general diretion of work: suggestions for capabilities that would also help with understanding and controlling network behavior. That would in turn be helpful for real alignment of network-based AGI. Proposing dual-use capabilities advances seems like a way to get alignment ideas actually implemented. That's what I've done in System 2 Alignment, although that's also a prediction about what developers might try for alignment by default.

Whether the approach you outline here would work is an empirical question, but it sounds likely enough that teams might actually put some effort into it. Preprocessing data to identify authors and similar categories wouldn't be that hard.

This helps with the problem Nate Soares characterized as making cognition aimable at all - having AI pursue one coherent goal, (separately from worrying about whether you can direct that "goal slot" toward something that actually works). I think that's the alignment issue you're addressing (along with slop potentially leading to bad AI-assisted alignment). I briefly describe the LLM agent alignment part of that issue in Seven sources of goals in LLM agents.

I hope I'm reading you right about why you think reducing AI slop would help with alignment.

“Sharp Left Turn” discourse: An opinionated review

Seth Herd3mo3-1

> we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right?

I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT.

If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely.

But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to give the agent explicit goals, too.

Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them.

I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work.

WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment.

I'm not sure I'm more optimistic than you, but I am quite uncertain about how well the likely (low but not zero effort/thought) methods of aligning network-based AGI might go. I think others should be more uncertain as well. Some people being certain of doom while others with real expertise thinking it's probably going to be fine should be a signal that we do not have this worked through yet.

That's why I like this post and similar attempts to resolve optimist/pessimist disagreements so much.

“Sharp Left Turn” discourse: An opinionated review

Seth Herd3mo71

I place this alongside the Simplicia/Doomimir dialogues as the farthest we've gotten (at least in publicly legible form) on understanding the dramatic disagreements on the difficulty of alignment.

There's a lot here. I won't try to respond to all of it right now.

I think the most important bit is the analysis of arguments for how well alignment generalizes vs. capabilities.

Conceptual representations generalize farther than sensory representations. That's their purpose. So when behavior (and therefore alignment) is governed by conceptual representations, it will generalize relatively well.

When alignment is based on a relatively simple reward model based on simple sensory representations, it won't generalize very well. That's the case with humans. The reward model is running on sensory representations (it has to so they can be specified in the limited information space of DNA, as you and others have discussed).

Alignment generalizes farther than capabilities in well-educated, carefully considered modern humans because our goals are formulated in terms of concepts. (There are still ways this could go awry, but I think most modern humans would generalize their goals well and lead us into a spectacular future if they were in charge of it).

This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible. If we could use natural language (or another route to conceptual representations) to specify an AI's goals, it seems like that would produce better generalization than just trying to train stuff in with RL to produce behavior we like in the training environment.

One method of "conceptual alignment" is the variant of your Plan for mediocre alignment of brain-like [model-based RL] AGI in which you more or less say to a trained AI "hey think about human flourishing" and then set the critic system's weights to maximum. Another is alignment-by-prompting for LLM-based agents; I discuss that in Internal independent review for language model agent alignment. I'm less optimistic now than when I wrote that, given the progress made in training vs. scripting for better metacognition - but I'm not giving up on it just yet. Tan Zhi Xuan makes the point in this interview that we're really training LLMs mostly to have a good world model and to follow instructions, similar to Andrej Karpathy's point that RLHF is just barely RL. It's similar with RLAIF and the reward models training R1 for usability, after the pure RL on verifiable answers. So we're still training models to have good world models and follow instructions. Played wisely, it seems like that could produce aligned LLM agents (should that route reach "real AGI").

That's a new formulation of an old thought, prompted by your framing of pitting the arguments for capabilities generalizing farther than alignment (for evolution making humans) and alignment generalizing farther than capabilities (for modern humans given access to new technologies/capabilities).

The alternative is trying to get an RL system to "gel" into a concept-based alignment we like. This happens with a lot of humans, but that's a pretty specific set of innate drives (simple reward models) and environment. If we monitored and nudged the system closely, that might work too.

Training AI agents to solve hard problems could lead to Scheming

Seth Herd5mo21

Here's my proposal for how we avoid this consequence of consequentialist goals: make the primary goal instruction-following. This is a non-consequentialist goal. All other goals are consequentialist subgoals of that one, when the human gives an instruction to accomplish something.

This would only prevent scheming to accomplish the consequentialist goals instructed your AGI to pursue if it was also used to give side-constraints like "don't lie to me" and lots of time carefully exploring its theories on what its goals mean and how to accomplish them. This approach seems likely to work - but I want to hear more pushback on it before I'd trust it in practice.

I think this is not only an interesting dodge around this class of alignment concerns, but it's the most likely thing to actually be implemented. When someone is actually getting close to launching a system they hope is or will become smarter than they are, they'll think a little harder about making its central goal "solve cancer" or anything else broad and consequentialist. The natural choice is to just extend what LLMs are mostly aligned for now: following instructions, including consequentialist instructions.

This logic is all laid out in more detail in Instruction-following AGI is easier and more likely than value aligned AGI, but I didn't specifically address scheming there.

Training AI agents to solve hard problems could lead to Scheming

Seth Herd5mo10

Edit note: you responded to approximately the first half of my eventual comment; sorry! I accidentally committed it half-baked, then quickly added the rest. But the meaning of the first part wasn't really changed, so I'll respond to your comments on that part.

I agree that it's not that simple in practice, because we'd try to avoid that by giving side constraints; but it is that simple in the abstract, and by default. If it followed our initial goal as we intended it there would be no problem; but the core of much alignment worry is that it's really hard to get exactly what we intended into an AI as its goal.

I also agree that good HHH training might be enough to overcome the consequentialist/instrumental logic of scheming. Those tendencies would function as side constraints. The AI would have a "character" that is in conflict with its instrumental goal. Which would win out would be a result of exactly how that goal was implemented in the AIs decision-making procedures, particularly the ones surrounding learning.

Training AI agents to solve hard problems could lead to Scheming

Seth Herd5mo40

To summarize:

AGI will have goals, because we want it to accomplish things for us
Scheming is often helpful in accomplishing goals
AGIs will realize they should scheme, then scheme.

Despite thinking the core logic is almost that simple, I think it's useful to have this set of thinking laid out so carefully and in the detail you give here.

I am also still a bit confused as to why this careful presentation is useful. I find the logic so compelling that needing to be walked carefully through it seems strange to me. And yet there are intelligent and well-informed people who say things like "there's no empirical evidence for scheming in AIs" in all seriousness. So I'd like to understand that perspective better.

While I don't fully understand the perspective that needs to be convinced that scheming is likely, I do have some guesses. I think on the whole it stems from understanding current AI systems well, and reasoning from there. Current systems do not really scheme; they lack the capacity. Those who reason by analogy with humans or with fictional or hypothetical generally superintelligent AI see scheming as extremely likely from a misaligned AGI, because they're assuming it will have all the necessary cognitive capacities.

There are more nuanced views, but I think those are the two starting points that generate this dramatic difference in opinions.

Some more specific common cruxes of disagreement on scheming likelihood:

Will AGI have consequentialist goals?
Will AGI be able to do novel reasoning?
Will AGI be able to do long-term planning?
Will AGI be able to learn continuously from experience?

I see the answers to all of these questions as being obviously, inevitably yes by default; all of these are useful, so we will keep building toward AGI with all of these capacities if nothing stops us. Having extremely useful transformative limited AGI (like super-foundation models) would not stop us from building "real AGI" with the above properties.

I've tried to convey why those properties seem so inevitable (and actually rather easy to add from here) in real AGI, Steering subsystems: capabilities, agency, and alignment, and Sapience, understanding, and "AGI", among snippets in other places. I'm afraid none of them is as clear or compelling as I'd like from the perspective of someone who starts reasoning from current AI and asks why or how would we include those dangerous properties in our future AGIs?

That's why I'm glad you guys are taking a crack at it in a more careful and expansive way, and from the perspective of how little we'd need to add to current systems to make them solve important problems, and how that gives rise to scheming. I'll be referencing this post on this point.

Edit note: Most of this was written after an accidental premature submit ctrl-return action.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Seth Herd9mo10

If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.

This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.

Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace, and hope I conveyed the proper intuition. Or I can design an AGI with the goal of doing what I tell it to do (or maybe "what I want"), then telling it to achieve world peace and giving it the definition. In the second case, its goal is similar but not identical to the first. It's a pointer to that goal instead of a direct definition. Thus, if the goal changes in my head (or my instructions in that variant), the pointer's value changes. In the first direct definition case, it does not.

Goals as pointers is not something any evolved system really has, but it seems logically consistent and to have major advantages for alignment. More on that in my instruction-following AGI... post.

Buck's Shortform

Seth Herd10mo10

What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close).

Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.