Perils of under- vs over-sculpting AGI desires

This is a really good post. Some minor musings:

If a human wound up in that situation, they would just think about it more, repeatedly querying their ‘ground truth’ social instincts, and come up with some way that they feel about that new possibility. Whereas AGI would … I dunno, it depends on the exact code. Maybe it would form a preference quasi-randomly? Maybe it would wind up disliking everything, and wind up sitting around doing nothing until it gets outcompeted? (More on conservatism here.)

Perhaps a difference in opinion is that it's really unclear to me that an AGI wouldn't do much the same thing of "thinking about it more, repeatedly querying their 'ground truth' social instincts" that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere. Where this somewhere is going to be some inscrutable combination of similar scenarios in pretraining data, generalization from humans talking about morality, and intuitions derived from the RLAIF phase which embeds Claude's constitution etc. Of course we can argue that Claude's 'social instincts' derived in this way are defective somehow compared to humans but it is unclear (to me) that this path cannot make AGIs with decent social instincts.

Perhaps a crux of differences in opinion between us is that I think that much more 'alignment relevant' morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data -- i.e. 'culture'. Now culture itself obviously is downstream of a lot of our social instincts but it is also based on other factors like game-theoretic equilibria which promote cooperation even among selfish agents and, very pertinently, using logical 'system 2' reasoning to try to generalize and extend our inchoate social instincts and then learn to backprop this new understanding into our learnt value functions. Utilitarianism, and this super generalized EA-style compassion it brings is a great example of this. No primitive tribesman or indeed very few humans before the 18th century had ever thought of or had moral intuitions aligned with these ideas. They are profoundly unnatural to our innate 'human social instincts'. (Some) people today feel these ideas viscerally because they have been exposed to them enough that they have propagated them from the world model back into the value function through in-lifetime learning.

We don't have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces. From the beginning of time to the 18th century almost nobody had any issues with slavery despite often living with slaves or seeing slave suffering on a daily basis. Today, only a few people have moral issues with eating meat despite the enormous mountain of suffering it causes to living animals right here on our own planet while eating meat only brings reasonable (and diminishing), but not humongously massive, benefits to our quality of life.

My thinking is that this 'far-mode' and 'literate/language/system2-derived' morality is actually better for alignment and human flourishing in general than the standard set of human social instincts -- i.e. I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human. Alignment is a high bar and ultimately we need to create minds far more 'saintly' than any living human could ever be.

What we then need to do is figure out how to distill this set of mostly good, highly verbal moral intuitions from culture into a value function that the model 'feels viscerally'. Of course reverse-engineering some human social instincts are probably important here -- i.e. our compassion instinct is good if generalized, and even more generally understanding how the combination of innate reward signals in the hypothalamus plus the representations in our world model gets people to feel viscerally about the fates of aliens we can never possibly interact with, is very important to understand.

Nevertheless, truly out-of-distribution things also exist, just as the world of today is truly out-of-distribution from the perspective of an ancient Egyptian.

As a side-note, it's really unclear how good humans are at generalizing at true out-of-distribution moralities. Today's morality likely looks pretty bad from the ancient Egyptian perspective. We are really bad at worshipping Ra and reconciling with our Ba's. It might be the case that, upon sufficient reflection, the Egyptians would come to realize that we are right all along, but of course we would say that in any case. I don't know how to solve this or whether there is in fact any general case solution to any degree of 'out-of-distribution-ness' except just like pure conservatism where you freeze both the values and the representations they are based on.

[-]Steven Byrnes3mo30

Belated thanks!

I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human … it's really unclear how good humans are at generalizing at true out-of-distribution moralities. Today's morality likely looks pretty bad from the ancient Egyptian perspective…

Hmm, I think maybe there’s something I was missing related to what you’re saying here, and that maybe I’ve been thinking about §8.2.1 kinda wrong. I’ve been mulling it over for a few days already, and might write some follow-up. Thanks.

Perhaps a difference in opinion is that it's really unclear to me that an AGI wouldn't do much the same thing of "thinking about it more, repeatedly querying their 'ground truth' social instincts" that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere…

I think LLMs as we know them today and use them today are basically fine, and that this fine-ness comes first and foremost from imitation-learning on human data (see my Foom & Doom post §2.3). I think some of my causes for concern are that, by the time we get to ASI…

(1) Most importantly, I personally expect a paradigm shift after which true imitation-learning on human data won’t be involved at all, just as it isn’t in humans (Foom & Doom §2.3.2) … but I’ll put that aside for this comment;

(2) even if imitation-learning (a.k.a. pretraining) remains part of the process, I expect RL to be a bigger and bigger influence over time, which will make human-imitation relatively less of an influence on the ultimate behavior (Foom & Doop §2.3.5);

(3) I kinda expect the eventual AIs to be kinda more, umm, aggressive and incorrigible and determined and rule-bending in general, since that’s the only way to make AIs that get things done autonomously in a hostile world where adversaries are trying to jailbreak or otherwise manipulate them, and since that’s the end-point of competition.

Perhaps a crux of differences in opinion between us is that I think that much more 'alignment relevant' morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data -- i.e. 'culture'.…

(You might already agree with all this:)

Bit of a nitpick, but I agree that absorbing culture is a “predictive world model” thing in LLMs, but I don’t think that’s true in humans, at least in a certain technical sense. I think we humans absorb culture because our innate drives make us want to absorb culture, i.e. it happens ultimately via RL. Or at least, we want to absorb some culture in some circumstances, e.g. we particularly absorb the habits and preferences of people we regard as high-status. I have written about this at “Heritability: Five Battles” §2.5.1, and “Valence & Liking / Admiring” §4.5.

See here for some of my thoughts on cultural evolution in general.

I agree that “game-theoretic equilibria” are relevant to why human cultures are how they are right now, and they might also be helpful in a post-AGI future if (at least some of) the AGIs intrinsically care about humans, but wouldn’t lead to AGIs caring about humans if they don’t already.

I think “profoundly unnatural” is somewhat overstating the disconnect between “EA-style compassion” and “human social instincts”. I would say something more like: we have a bunch of moral intuitions (derived from social instincts) that push us in a bunch of directions. Every human movement / ideology / meme draws from one or more forces that we find innately intuitively motivating: compassion, justice, spite, righteous indignation, power-over-others, satisfaction-of-curiosity, etc.

So EA is drawing from a real innate force of human nature (compassion, mostly). Likewise, xenophobia is drawing from a real innate force of human nature, and so on. Where we wind up at the end of the day is a complicated question, and perhaps underdetermined. (And it also depends on an individual’s personality.) But it’s not a coincidence that there is no EA-style group advocating for things that have no connection to our moral intuitions / human nature whatsoever, like whether the number of leaves on a tree is even vs odd.

We don't have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces…

Just to clarify, the context of that thought experiment in the OP was basically: “It’s fascinating that human compassion exists at all, because human compassion has surprising and puzzling properties from an RL algorithms perspective.”

Obviously I agree that callous indifference also exists among humans. But from an RL algorithms perspective, there is nothing interesting or puzzling about callous indifference. Callous indifference is the default. For example, I have callous indifference about whether trees have even vs odd numbers of leaves, and a zillion other things like that.

[-]Fabien Roger4mo20

I like this framing!

So in other words, when you, a human, ask yourself whether something is or is not “human flourishing”, you’re following a pointer to the full power of your human moral and philosophical reasoning (Valence series §2.7). So no wonder the concept “human flourishing” seems (from your perspective) to generalize well to out-of-distribution scenarios! [...]
By contrast, when an AGI is deciding whether some new situation is or isn’t a good pattern-match to “human flourishing”, it does not have a pointer to the ground-truth human reward-function, and thus the full power of human philosophical introspection.

I feel like this undersells somewhat how good even current under-fitted AIs are at generalizing human moral judgment to novel situations.

My guess is that your moral judgment of world trajectories after 1 day of reflection is closer to what Claude 4 Opus would say than to the 1-day moral judgment of the majority of humans. I share your hope that if we are not speaking about the 1-day moral judgment but something closer to a long reflection, then most humans end up quite close (and in particular the majority ends up closer to you than to Claude 4 Opus) because of the mostly-shared "ground-truth human reward signals", but I don't feel very confident in this (p=0.7). If you are more confident than me, I am curious why!

(Just to spell out why I think there is diversity between humans: (1) there might be a lot of path dependence, especially when deciding what the long reflection should look like and how much to tap the human ground truth reward signal and the differences between humans' current desires are quite large and (2) the ground truth reward signal might differ significantly between humans - there are some well-known edge cases like psychopaths, but there might also be much more mundane diversity.)

(Even if it was the case that Claude 4 Opus was closer to you than to the majority of humans, this is not to say that letting an AI that is as poorly aligned as Claude 4 Opus control the future would be a good idea according to your lights; it would likely be bad both on common sense and ECL grounds.)

^{^}

In my neuroscience writing, I would write “valence guess” instead of “value function” here.

^{^}

See my argument that AGI is all about being able to figure out and internalize new things, here & here.

^{^}

Side note: Doesn’t turning off (or limiting) this second algorithm affect capabilities as well as alignment? My answer is: maybe to some extent. It depends on how much work is being done by foresight / planning / model roll-outs (which don’t require desire-updates) versus learned “habits” and such (which do). The latter is a bit like “caching” regular patterns in the former. Anyway, if desire-updates are necessary for capabilities at all, we can probably keep those desires weak, such that they don’t outvote alignment-relevant desires. The upshot is: I think everything I’ll discuss in this post is compatible with very powerful AGI capabilities, and thus I will just be focusing on alignment.

^{^}

Other options include: making a successor agent (possibly initialized as a copy, but with different desire-update code), or building itself a button that allows it to pause, inspect, and roll back desire-updates between when they’re calculated by the learning algorithm and when the weights actually change, or … I dunno, use your imagination.

^{^}

E.g. the part Alex describes as: “[the AI] gets reflective and smart and able to manage its own value drift”.

^{^}

Fight me, ethicists.

^{^}

I have some nitpicky complaints about that post and how it frames things, but that’s off-topic.

^{^}

See also “The limits of corrigibility” (Stuart Armstrong, 2018) for more edge-cases in what constitutes helpfulness.

^{^}

I expect some readers to object: “I can ask an LLM whether some weird sci-fi thing is an example of ‘human flourishing’ or not, and it does fine! Didn’t you see Beren Millidge’s Alignment likely generalizes further than capabilities”? My response: that is not actually out-of-distribution, because internet text is chock full of sci-fi stories, including people’s far mode reactions to them. Nevertheless, truly out-of-distribution things also exist, just as the world of today is truly out-of-distribution from the perspective of an ancient Egyptian. See my Sharp Left Turn post, §4.

^{^}

I think this is an important blind spot of some “natural abstractions” discourse. Namely: You can’t decide whether Novel Technology T is good or bad except by having opinions about the hypothetical future world in which Novel Technology T has already been deployed. And that future world might well have different “natural abstractions” than the present world. In this 2022 thread I tried to raise this point to John Wentworth, but we kinda were talking past each other and didn’t resolve it.

^{^}

To be clear, philosophical errors can be very bad! “Slavery is fine” is a philosophical error. So is “it’s totally fine to torture hundreds of billions of animals”.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

25

Perils of under- vs over-sculpting AGI desires

25

1. Summary and Table of Contents

1.1 Summary

1.2 Table of contents

2. Background intuitions

2.1 Specification gaming is bad

2.2 Specification gaming is hard to avoid. (“The river wants to flow into the sea”.)

3. Basic idea

3.1 Overview of the desire-sculpting learning algorithm (≈ TD learning updates to the value function)

3.2 “Just stop editing the AGI desires” as a broad strategy against specification-gaming

4. Example A: Manually insert some desire at time t, then turn off (or limit) the desire-sculpting algorithm

5. Example B: The AGI itself prevents desire-updates

5.1 “Deliberate incomplete exploration”

5.2 More direct ways that an AGI might prevent desire-updates

5.3 When does the AGI want to prevent desire-updates?

5.4 More examples from the alignment literature

6. Example C: “Non-behaviorist rewards”

6.1 The Omega-hates-aliens scenario

6.2 “Defer-to-predictor mode” in visceral reactions, and “trapped priors”

7. Relation to other frameworks

7.1 Relation to (how I previously thought about) inner and outer misalignment

7.2 Relation to the traditional overfitting problem

8. Perils

8.1 The minor peril of under-sculpting: path-dependence

8.2 The major peril of under-sculpting: concept extrapolation

8.2.1 There’s a perceptual illusion that makes concept extrapolation seem less fraught than it really is

8.2.2 The hope of pinning down non-fuzzy concepts for the AGI to desire

8.2.3 How blatant and egregious would be a misalignment failure from concept extrapolation?

9. Conclusion