A collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches. 

16Daniel Kokotajlo
This post is the best overview of the field so far that I know of. I appreciate how it frames things in terms of outer/inner alignment and training/performance competitiveness--it's very useful to have a framework with which to evaluate proposals and this is a pretty good framework I think. Since it was written, this post has been my go-to reference both for getting other people up to speed on what the current AI alignment strategies look like (even though this post isn't exhaustive). Also, I've referred back to it myself several times. I learned a lot from it. I hope that this post grows into something more extensive and official -- maybe an Official Curated List of Alignment Proposals, Summarized and Evaluated with Commentary and Links. Such a list could be regularly updated and would be very valuable for several reasons, some of which I mentioned in this comment.

Popular Comments

Recent Discussion

Every now and then, some AI luminaries

  • (1) propose that the future of powerful AI will be reinforcement learning agents—an algorithm class that in many ways has more in common with MuZero (2019) than with LLMs; and
  • (2) propose that the technical problem of making these powerful future AIs follow human commands and/or care about human welfare—as opposed to, y’know, the Terminator thing—is a straightforward problem that they already know how to solve, at least in broad outline.

I agree with (1) and strenuously disagree with (2).

The last time I saw something like this, I responded by writing: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem.

Well, now we have a second entry in the series, with the new preprint book chapter “Welcome to the Era of Experience” by...

4Steve Byrnes
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”. (Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.) So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out. So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too. I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)

My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

1cousin_it
I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill. The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?
2Steve Byrnes
The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store. “We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it. This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.

In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems.

We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning. 

Introduction:

Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.

Controlling the beliefs of

...
23Sam Marks
Copying over further discussion from X. Sam Marks (me): Jan Kulveit: Sam Marks: Jan Kulveit: Sam Marks:

This is a great thread and I appreciate you both having it, and posting it here!

3Fabien Roger
You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way. While I think these evals provide decent evidence about what the LLMs in the experiment "actually believe" (I don't expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they "try harder"), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs. One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights. I think this is a much harder standard to meet (I think the models in the experiments in the blog post don't meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs "actually try"). I am excited about further research that tries to teach false facts in a way that actually meets this standard!
3Ryan Greenblatt
And there is some ongoing future work which is currently trying to meet this higher standard! (Or at least something similar to it.)

I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.

This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency. 

Theories of intelligent agency

By a “theory of intelligent agency” I mean a...

cousin_it*10

Here's maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there's a way to do a secure "handshake"). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.

A real world analogy is some of the nuclear precommitments mentioned in Schelling's book. Like whe... (read more)

4Richard Ngo
I think my thought process when I typed "risk-averse money-maximizer" was that an agent could be risk-averse (in which case it wouldn't be an EUM) and then separately be a money-maximizer. But I didn't explicitly think "the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility", so I appreciate the clarification.
2johnswentworth
Note that the same mistake, but with convexity in the other direction, also shows up in the OP: An EUM can totally prefer a probabilistic mixture of two options to either option individually; this happens whenever utility is convex with respect to resources (e.g. money). For instance, suppose an agent's utility is u(money) = money^2. I offer this agent a $1 bet on a fair coinflip at even odds, i.e. it gets $0 if it loses and $2 if it wins. The agent takes the bet: the bet offers u = 0.5*0^2 + 0.5*2^2 = 2, while the baseline offers a certainty of $1 which has u = 1.0*1^2 = 1.
4Richard Ngo
Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2. In other words, your example rebuts the claim that an EUM can't prefer a probabilistic mixture of two options to the expectation of those two options. But that's not the claim I made.

tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.

If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way...

7Zach Stein-Perlman
A crucial step is bouncing off the bumpers. I'm surprised[1] that you're optimistic about this. I would have guessed that concerning-audit-results don't help you solve the problem much. Like if you catch sandbagging that doesn't let you solve sandbagging. I get that you can patch simple obvious stuff—"redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data"—but mostly I don't know how to tell a story where concerning-audit-results are very helpful. 1. ^ (I'm actually ignorant on this topic; "surprised" mostly isn't a euphemism for "very skeptical.")
21Sam Bowman
That's part of Step 6! I think we probably do have different priors here on how much we'd be able to trust a pretty broad suite of measures, but I agree with the high-level take. Also relevant:
kave54

Meta: I'm confused and a little sad about the relative upvotes of Habryka's comment (35) and Sam's comment (28). I think it's trending better, but what does it even mean to have a highly upvoted complaint comment based on a misunderstanding, especially one more highly upvoted than the correction?

Maybe people think Habryka's comment is a good critique even given the correction, even though I don't think Habryka does?

6Oliver Habryka
Ah, indeed! I think the "consistent" threw me off a bit there and so I misread it on first reading, but that's good.  Sorry for missing it on first read, I do think that is approximately the kind of clause I was imagining (of course I would phrase things differently and would put an explicit emphasis on coordinating with other actors in ways beyond "articulation", but your phrasing here is within my bounds of where objections feel more like nitpicking).

In this post I present a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. This model suggests that given sufficiently rich training environments (and sufficient reasoning ability), models which terminally value on-episode reward-proxies are disadvantaged relative to training-gamers.

A key point is that training gamers can still contain large quantities of learned heuristics (context-specific drives). By viewing these drives as instrumental and having good instincts for when to trust them, a training-gamer can capture the benefits of both instinctive adaptation and explicit reasoning without paying much of a speed penalty.

Core claims

  1. To get high reward, models must learn context-specific heuristics which are not derivable by reasoning from the goal of reward maximization.[1]  It is also true that thinking explicitly about reward
...

I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:

  • Added 2 sentences emphasizing the point that schemers probably won't be aware of their terminal goal in most contexts.  I thought this was clear from the post already, but apparently it wasn't.
  • Modified "What factors affect the likelihood of training-gaming?" to emphasize that "sum of proxies" and "reward-seeker" are points on a spectrum.  We might get an in-between model where context-dependent drives conflict with higher goals and sometimes "win" even out
... (read more)

Epistemic status: The important content here is the claims. To illustrate the claims, I sometimes use examples that I didn't research very deeply, where I might get some facts wrong; feel free to treat these examples as fictional allegories.

In a recent exchange on X, I promised to write a post with my thoughts on what sorts of downstream problems interpretability researchers should try to apply their work to. But first, I want to explain why I think this question is important.

In this post, I will argue that interpretability researchers should demo downstream applications of their research as a means of validating their research. To be clear about what this claim means, here are different claims that I will not defend here:

Not my claim: Interpretability researchers should demo

...
3Vivek Hebbar
In the fictional dialogue, Claude Shannon's first answer is more correct -- info theory is useful far outside the original domain of application, and its elegance is the best way to predict that.

I disagree. Consider the following two sources of evidence that information theory will be broadly useful:

  1. Information theory is elegant.
  2. There is some domain of application in which information theory is useful.

I think that (2) is stronger evidence than (1). If some framework is elegant but has not been applied downstream in any domain after a reasonable amount of time, then I don't think its elegance is strong reason to nevertheless believe that the framework will later find a domain of application.

I think there's some threshold number of downstream applic... (read more)

Load More