Review

Short version

Humans have an innate motivation ("preference fulfillment", PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word "love", as well as related words such as "kindness" and "compassion".

I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer.

I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/roles/simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve both the actual people present in the situation as well as various other people whose opinions we've internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation.

Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance.

If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities.

Long version

The preference fulfillment hypothesis

Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled.

It might be very simple ("I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate"), but it might get deep into pretty fundamental differences in preferences and values ("I'm deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way"). 

It's not necessarily absolute - some things you might still find really upsetting and you'd still want to override the other person’s preferences in some cases - but you can at least feel the "I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied" thing to some extent.

I think this kind of desire is something like its own distinct motivation in the human mind. It can easily be suppressed by other kinds of motivations kicking in - e.g. if the other person getting what they wanted made you feel jealous or insecure, or if their preferences involved actively hurting you. But if those other motivations aren’t blocking it, it can easily bubble up. Helping other people often just feels intrinsically good, even if you know for sure that you yourself will never get any benefit out of it (e.g. holding a door open for a perfect stranger in a city you’re visiting and will probably never come back to).

The motivation seems to work by something like simulating the other person based on what you know of them (or people in general), and predicting what they would want in various situations. This is similar to how "shoulder advisors" are predictive models that simulate what someone you know would react in a particular situation, and also somewhat similar to how large language models simulate the way a human would continue a piece of writing. The thought of the (simulated/actual) person getting what they want (or just existing in the first place) then comes to be experienced as intrinsically pleasing. 

A friend of mine collects ball-jointed dolls (or at least used to); I don’t particularly care about them, but I like the thought of my friend collecting them and having them on display, because I know it’s important for my friend. If I hear about my friend getting a new doll, then my mental simulation of her predicts that she will enjoy it, and that simulated outcome makes me happy. If I were to see some doll that I thought she might like, I would enjoy letting her know, because my simulation of her would appreciate finding out about that doll.

If I now think of her spending time with her hobby and finding it rewarding, then I feel happy about that. Basically, I'm running a mental simulation of what I think she's doing, and that simulation makes me happy.

While I don't know exactly how, this algorithm seems corrigible. If it turned out that my friend had lost her interest in ball-jointed dolls, then I’d like to know that so that I could better fulfill her preferences.

The kinds of normal people who aren't on Less Wrong inventing needlessly convoluted technical-sounding ways of expressing everyday concepts would probably call this thing "love" or “caring”. And genuine love (towards a romantic partner, close friend, or child/parent) definitely involves experiencing what I have just described. Terms such as kindness and compassion are also closely related. To avoid bringing in possibly unwanted connotations from those common terms, I’ll call this thing “preference fulfillment” or PF for short.

Preference fulfillment: a motivational drive that simulates the preferences of other people (or animals) and associates a positive reward with the thought of them getting their preferences fulfilled. Also associates a positive reward with thought of them merely existing.

I hypothesize that PF (or the common sense of the word “love”) is merely adding one additional piece (the one that makes you care about the simulations) to an underlying prediction and simulation machinery that is already there and exists for making social interaction and coordination possible in the first place.

Cooperation requires simulation

In this section, I’ll say a few words about why running these kinds of simulations of other people seems to be a prerequisite for any kind of coordination we do daily.

Under the “virtual bargaining” model of cooperation, people coordinate without communication by behaving on the basis of what they would agree to do if they were explicitly to bargain, provided the agreement that would arise from such discussion is commonly known. 

A simple example is that of two people carrying a table across the room: who should grab which end of the table? Normally, the natural solution is for each to grab the side that minimizes the joint distance moved (see picture). However, if one of the people happens to be a despot and the other a servant, then the natural solution is for the despot to grab the end that’s closest to them, forcing the servant to walk the longer distance.

This kind of coordination tends can happen automatically and wordlessly as long as we have some model of the other person’s preferences. Mutual simulation is also still required even if the slave and the despot hate each other - in order to not get punished for being a bad servant, the servant still needs to simulate the despot’s desires. And the despot needs to simulate the servant’s preferences in order to know what the servant will do in different situations.

I think this kind of a mental simulation is on some level active in basically every situation where we interact with other people. If you are in a grocery store, you know not to suddenly take off your clothes and start dancing in the middle of the store, because you know that the other people would stare at you and maybe call the police. You also know how you are expected to interact with the clerk, and the steps involved in the verbal dance of “hello how are you, yes that will be all, thank you, have a nice day”. 

As a child, you also witnessed how adults acted in a store. You are probably also running some simulation of “how does a normal kind of a person (like my parents) act in a grocery store”, and intuitively trying to match that behavior. In contrast, if you’re suddenly put into a situation where you don’t have a good model of how to act (maybe in a foreign country where the store seems to act differently from how you’re used to) and can’t simulate the reactions of other people in advance, you may find yourself feeling anxious.

ChatGPT may be an alien entity wearing a human-like mask. Meanwhile, humans may be non-alien entities wearing person-like masks. It’s interesting to compare the Shoggoth-ChatGPT meme picture below, with Kevin Simler’s comic of personhood.

Kevin writes:

A person (as such) is a social fiction: an abstraction specifying the contract for an idealized interaction partner. Most of our institutions, even whole civilizations, are built to this interface — but fundamentally we are human beings, i.e., mere creatures. Some of us implement the person interface, but many of us (such as infants or the profoundly psychotic) don't. Even the most ironclad person among us will find herself the occasional subject of an outburst or breakdown that reveals what a leaky abstraction her personhood really is.

And offers us this comic:


So for example, a customer in a grocery store will wear the “grocery store shopper” mask; the grocery store clerk will wear the “grocery store clerk” mask. That way, both will act the way that’s expected of them, rather than stripping their clothes off and doing a naked dance. And this act of wearing a mask seems to involve running a simulation of what “a typical grocery store person” would do and how other people would react to various behaviors in the store. We’re naturally wired to use these masks/roles/simulations to determine the right behavior in any social situation.

Some of the other people being simulated are the actual other people in the store, others are various people whose opinions you’ve internalized and care about. E.g. if you ever had someone shame you for a particular behavior, even when that person isn’t physically present, a part of your mind may be simulating that person as an “inner critic” who will virtually shame you for the thought of any such behavior. 

And even though people do constantly misunderstand each other, we don't usually descend to Outcome Pump levels of misunderstanding (where you ask me to get your mother out of a burning building and I blow up the building so that she gets out but is also killed in the process, because you never specified that you wanted her to get out alive). The much more common scenario are countless of minor interactions of the type where people just go to a grocery store and act the way a normal grocery store shopper would, or where two people glance at a table that needs to be carried and wordlessly know who should grab which end.

Preference fulfillment may be natural

PF then, is when you take your already-existing simulation of what other people would want, and just add a bit of an extra component that makes you intrinsically value those people getting what your simulation says they want. In the grocery store, it’s possible that you’re just trying to fulfill the preferences of others because you think you’d be shamed if you didn’t. But if you genuinely care about someone, then you actually intrinsically care about seeing their preferences fulfilled. 

Of course, it’s also possible to genuinely care about other people in a grocery store (as well as to be afraid of a loved one shaming you). In fact, correctly performing a social role can feel enjoyable by itself.

Even when you don’t feel like you love someone in the traditional sense of the word, some of the PF drive seems to be active in most social situations. Wordlessly coordinating on things like how to carry the table or how to move can feel intrinsically satisfying, assuming that there are no negative feelings such as fear blocking the sastisfaction. (At least in my personal experience, and also evidenced by the appeal of activities such as dance.)

The thesis that PF involves simulating others + intrinsically valuing the satisfaction of their preferences stands in contrast with models such as the one in "Niceness is Unnatural", which holds that 

the specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment.

The preference fulfillment hypothesis is that the exact details of when niceness/kindness/compassion/love/PF is allowed to express itself is indeed very contingent on the specifics of our ancestral environment. That is, our brains have lots of complicated rules for when to experience PF towards other people, and when to feel hate/envy/jealousy/fear/submission/dominance/transactionality/etc. instead, and the details of those rules are indeed shaped by the exact details of our evolutionary history. It’s also true that the specific social roles that we take are very contingent on the exact details of our evolution and our culture.

But the motivation of PF (“niceness”) itself is simple and natural - if you have an intelligence that is capable of acting as a social animal and doing what other social animals ask from it, then it already has most of the machinery required for it to also implement PF as a fundamental intrinsic drive. If you can simulate others, you only need to add the component that intrinsically values those simulations getting what they want.

This implies that capabilities progress may be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better, so as to understand what exactly would satisfy the preferences of the humans. The way we’re depicting large language models as shoggoths wearing a human mask, suggest that they are already starting to do so. While they may often “misunderstand” your intent, they already seem to be better at it than a pure Outcome Pump would be.

If the ability to simulate others in a way sufficient to coordinate with them forms most of the machinery required for PF, then capabilities progress might deliver most of the progress necessary for alignment.

Some kind of a desire to simulate and fulfill the desires of others seems to show up very early. Infants have it. Animals being trained have their learning accelerated once they figure out they're being trained and start proactively trying to figure out what the trainer intends. Both point to these being simple and natural competencies.

Humans are often untrustworthy because of all the conflicting motivations and fears they're running. (“If I feel insecure about my position and the other person seems likely to steal it, suppress love and fear/envy/hate them instead.”) However, an AI wouldn't need to exhibit any of the evolutionary urges for backstabbing and the like. We could take the prediction + love machinery and make that the AI’s sole motivation (maybe supplemented by some other drives such as intrinsic curiosity to boost learning).

On the other hand

Of course, this does not solve all problems with alignment. A huge chunk of how humans simulate each other seems to make use of structural similarities. Or as Niceness is unnatural also notes:

It looks pretty plausible to me that humans model other human beings using the same architecture that they use to model themselves. This seems pretty plausible a-priori as an algorithmic shortcut — a human and its peers are both human, so machinery for self-modeling will also tend to be useful for modeling others — and also seems pretty plausible a-priori as a way for evolution to stumble into self-modeling in the first place ("we've already got a brain-modeler sitting around, thanks to all that effort we put into keeping track of tribal politics").

Under this hypothesis, it's plausibly pretty easy for imaginations of others’ pain to trigger pain in a human mind, because the other-models and the self-models are already in a very compatible format.

This seems true to me. On the other hand, LLMs are definitely running a very non-humanlike cognitive architecture, and seem to at least sometimes manage a decent simulation. People on the autistic spectrum may also have the experience of understanding other people better than neurotypicals do. The autistics had to compensate for their lack of “hardware-accelerated” intuitive social modeling by coming with explicit models of what drives the behavior of other people, until they got better at it than people who never needed to develop those models. And humans often seem to have significant differences [12] in how their minds work, but still manage to model each other decently, especially if someone tells them about those differences so that they can update their models.

Another difficulty is that humans also seem to incorporate various ethical considerations into their model - e.g. we might feel okay with sometimes overriding the preferences of a young child or a mentally ill person, out of the assumption that their future self would endorse and be grateful for it. Many of these considerations seem strongly culturally contingent, and don’t seem to have objective answers.

And of course, even though humans are often pretty good at modeling each other, it’s also the case that they still frequently fail and mispredict what someone else would want. Just because you care about fulfilling another person's preferences does not mean that you have omniscient access to them. (It does seem to make you corrigible with regard to fulfilling them, though.)

I sometimes see people suggesting things like “the main question is whether AI will kill everyone or not; compared to that, it’s pretty irrelevant which nation builds the AI first”. On the preference fulfillment model, it might be the other way around. Maybe it’s relatively easy to make an AI that doesn’t want to kill everyone, as long as you set it to fulfill the preferences of a particular existing person who doesn’t want to kill everyone. But maybe it’s also easy to anchor it into the preferences of one particular person or one particular group of people (possibly by some process analogous to how children initially anchor into the desires of their primary caregivers), without caring about the preferences of anyone else. In that case, it might impose the values of that small group on the world, where those values might be arbitrarily malevolent or just indifferent towards others. 


 

New Comment
10 comments, sorted by Click to highlight new comments since:

PF then, is when you take your already-existing simulation of what other people would want, and just add a bit of an extra component that makes you intrinsically value those people getting what your simulation says they want. … This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities.

Seems to me that the following argument is analogous:

A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or more specifically, “what humans mean when they say the words ‘not killing everyone’”). We just add a bit of an extra component that makes the AGI intrinsically value that concept. This implies that capabilities progress may be closely linked to alignment progress.

Or ditto where “not killing everyone” is replaced by “helpfulness” or “CEV” or whatever. Right?

So I’m not clear on why PF would imply something different about the alignment-versus-capabilities relationship from any of those other things.

Do you agree or disagree?

Anyway, for any of these, I think “the bit of an extra component” is the rub. What’s the component? How exactly does it work? Do we trust it to work out of distribution under optimization pressure?

In the PF case, the unsolved problems [which I am personally interested in and working on, although I don’t have any great plan right now] IMO are more specifically: (1) we need to identify which thoughts are and aren’t empathetic simulations corresponding to PF; (2) we need the AGI to handle edge-cases in a human-like (as opposed to alien) way, such as distinguishing doing-unpleasant-things-that-are-good-ideas-in-hindsight from being-brainwashed, or the boundaries of what is or isn’t human, or what if the human is drunk right now, etc. I talk about those a lot more here and here, see also shorter version here.

A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or more specifically, “what humans mean when they say the words ‘not killing everyone’”). We just add a bit of an extra component that makes the AGI intrinsically value that concept. This implies that capabilities progress may be closely linked to alignment progress.

Some major differences off the top of my head:

  • Picking out a specific concept such as "not killing everyone" and making the AGI specifically value that seems hard. I assume that the AGI would have some network of concepts and we would then either need to somehow identify that concept in the mature network, or design its learning process in such a way that the mature network would put extra weight on that. The former would probably require some kinds of interpretability tools for inspecting the mature network and making sense of its concepts, so is a different kind of proposal. As for the latter, maybe it could be done, but any very specific concepts don't seem to have an equally simple/short/natural algorithmic description as simulating the preferences of others seems to have, so it'd seem harder to specify.
  • The framing of the question also implies to me that the AGI also has some other pre-existing set of values or motivation system besides the one we want to install, which seems like a bad idea since that will bring the different motivation systems into conflict and create incentives to e.g. self-modify or otherwise bypass the constraint we've installed.
  • It also generally seems like a bad idea to go manually poking around the weights of specific values and concepts without knowing how they interact with the rest of AGI's values/concepts. Like if we really increase the weight of "don't kill everyone" but don't look at how it interacts with the other concepts, maybe that will lead to a With Folded Hands scenario when the AGI decides that letting people die by inaction is also killing people and it has to prevent humans from doing things that might kill them. (This is arguably less of a worry for something like "CEV", but even we don't seem to know what exactly CEV even should be, so I don't know how we'd put that in.)

Thanks! Hmm, I think we’re mixing up lots of different issues:

  • 1. Is installing a PF motivation into an AGI straightforward, based on what we know today? 

I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )

If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I don’t expect either of those to happen (well, at least not the former, and/or not directly). Things can emerge inside a trained model instead of being in the source code, and if so, then finding them is tricky.

  • 2. Will installing a PF motivation into an AGI be straightforward in the future “by default” because capabilities research will teach us more about AGI than we know today, and/or because future AGIs will know more about the world than AIs today?

I say “no” to both. For the first one, I really don’t think capabilities research is going to help with this, for reasons here. For the second one, you write in OP that even infants can have a PF motivation, which seems to suggest that the problem should be solvable independent of the AGI understanding the world well, right?

  • 3. Is figuring out how to install a PF motivation a good idea?

I say “yes”. For various reasons I don’t think it’s sufficient for the technical part of Safe & Beneficial AGI, but I’d rather be in a place where it is widely known how to install PF motivation, than be in a place where nobody knows how to do that.

  • 4. Independently of which is a better idea, is the technical problem of installing a PF motivation easier, harder, or the same difficulty as the technical problem of installing a “human flourishing” motivation?

I’m not sure I care. I think we should try to solve both of those problems, and if we succeed at one and fail at the other, well I guess then we’ll know which one was the easier one. :-P

That said, based on my limited understanding right now, I think there’s a straightforward method kinda based on interpretability that would work equally well (or equally poorly) for both of those motivations, and a less-straightforward method based on empathetic simulation that would work for human-style PF motivation and maybe wouldn’t be applicable to “human flourishing” motivation. I currently feel like I have a better understanding of the former method, and more giant gaps in my understanding of the latter method. But if I (or someone) does figure out the latter method, I would have somewhat more confidence (or, somewhat less skepticism) that it would actually work reliably, compared to the former method.

Thanks, this seems like a nice breakdown of issues!

If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I don’t expect either of those to happen (well, at least not the former, and/or not directly). Things can emerge inside a trained model instead of being in the source code, and if so, then finding them is tricky.

So I don't think that there's going to be hand-written source code with slots for inserting variables. When I expect it to have a "natural" algorithmic description, I mean natural in a sense that's something like "the kinds of internal features that LLMs end up developing in order to predict text, because those are natural internal representations to develop when you're being trained to predict text, even though no human ever hand-coded or them or even knew what they would be before inspecting the LLM internals after the fact".

Phrased differently, the claim might be something like "I expect that if we develop more advanced AI systems that are trained to predict human behavior and to act in a way that they predict to please humans, then there is a combination of cognitive architecture (in the sense that "transformer-based LLMs" are a "cognitive architecture") and reward function that will naturally end up learning to do PF because that's the kind of thing that actually does let you best predict and fulfill human preferences".

The intuition comes from something like... looking at LLMs, it seems like language was in some sense "easy" or "natural" - just throw enough training data at a large enough transformer-based model, and a surprisingly sophisticated understanding of language emerges. One that probably ~nobody would have expected just five years ago. In retrospect, maybe this shouldn't have been too surprising - maybe we should expect most cognitive capabilities to be relatively easy/natural to develop, and that's exactly the reason why evolution managed to find them.

If that's the case, then it might be reasonable to assume that maybe PF could be the same kind of easy/natural, in which case it's that naturalness which allowed evolution to develop social animals in the first place. And if most cognition runs on prediction, then maybe the naturalness comes from something like there only being relatively small tweaks in the reward function that will bring you from predicting & optimizing your own well-being to also predicting & optimizing the well-being of others.

If you ask me what exactly that combination of cognitive architecture and reward function is... I don't know. Hopefully, e.g. your research might one day tell us. :-) The intent of the post is less "here's the solution" and more "maybe this kind of a thing might hold the solution, maybe we should try looking in this direction".

  • 2. Will installing a PF motivation into an AGI be straightforward in the future “by default” because capabilities research will teach us more about AGI than we know today, and/or because future AGIs will know more about the world than AIs today?

I say “no” to both. For the first one, I really don’t think capabilities research is going to help with this, for reasons here. For the second one, you write in OP that even infants can have a PF motivation, which seems to suggest that the problem should be solvable independent of the AGI understanding the world well, right?

I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities - but I think that doesn't establish the opposite direction of "might capabilities be necessary for social instincts" or "might capabilities research contribute to social instincts"?

If my model above is right, that there's a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.

  • 3. Is figuring out how to install a PF motivation a good idea?

I say “yes”.

You're probably unsurprised to hear that I agree. :-)

  • 4. Independently of which is a better idea, is the technical problem of installing a PF motivation easier, harder, or the same difficulty as the technical problem of installing a “human flourishing” motivation?

What do you have in mind with a "human flourishing" motivation?

What do you have in mind with a "human flourishing" motivation?

An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world. There are a lot of unsolved problems and things that could go wrong with that (further discussion here), but I think something like that is not entirely implausible as a long-term alignment research vision.

I guess the anthropomorphic analog would be: try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says to you: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape. “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or whatever.)

How would that event change your motivations? Well, you’re probably going to spend a lot more time gazing at the moon when it’s in the sky. You’re probably going to be much more enthusiastic about anything associated with the moon. If there are moon trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a lunar exploration mission, maybe you would be first in line. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that.

Now by the same token, imagine we do that kind of thing for an extremely powerful AGI and the concept of “human flourishing”. What actions will this AGI then take? Umm, I don’t know really. It seems very hard to predict. But it seems to me that there would be a decent chance that its actions would be good, or even great, as judged by me.

I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities - but I think that doesn't establish the opposite direction of "might capabilities be necessary for social instincts" or "might capabilities research contribute to social instincts"?

Sorry, that’s literally true, but they’re closely related. If answering the question “What reward function leads to human-like social instincts?” is unhelpful for capabilities, as I claim it is, then it implies both (1) my publishing such a reward function would not speed capabilities research, and (2) current & future capabilities researchers will probably not try to answer that question themselves, let alone succeed. The comment I linked was about (1), and this conversation is about (2).

If my model above is right, that there's a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.

Sure, but “the representation is somewhere inside this giant neural net” doesn’t make it obvious what reward function we need, right? If you think LLMs are a good model for future AGIs (as most people around here do, although I don’t), then I figure those representations that you mention are already probably present in GPT-3, almost definitely to a much larger extent than they’re present in human toddlers. For my part, I expect AGI to be more like model-based RL, and I have specific thoughts about how that would work, but those thoughts don’t seem to be helping me figure out what the reward function should be. If I had a trained model to work with, I don’t think I would find that helpful either. With future interpretability advances maybe I would say “OK cool, here’s PF, I see it inside the model, but man, I still don’t know what the reward function should be.” Unless of course I use the very-different-from-biology direct interpretability approach (analogous to the “human flourishing” thing I mentioned above).

Update: writing this comment made me realize that the first part ought to be a self-contained post; see Plan for mediocre alignment of brain-like [model-based RL] AGI. :)

An observation: it feels slightly stressful to have posted this. I have a mental simulation telling me that there are social forces around here that consider it morally wrong or an act of defection to suggest that alignment might be relatively easy, like it implied that I wasn't taking the topic seriously enough or something. I don't know how accurate that is, but that's the vibe that my simulators are (maybe mistakenly) picking up.

I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.

But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:

  • Present pseudo-niceness: maximize the expected value of the fulfillment   of people's preferences  across time. A weak AI (or a weak human) being present pseudo-nice would be indistinguishable from someone being actually nice. But something very agentic and powerful would see the opportunity to influence people's preferences so that they are easier to satisfy, and that might lead to a world of people who value suffering for the glory of their overlord or sth like that.
  • Future pseudo-niceness: maximize the expected value of all future fulfillment   of people's initial preferences . Again, this is indistinguishable from niceness for weak AIs. But this leads to a world which locks in all the terrible present preferences people have, which is arguably catastrophic.

I don't know how you would describe "true niceness", but I think it's neither of the above.

So if you train an AI to develop "niceness", because AIs are initially weak, you might train niceness, or you might get one of the two pseudo niceness I described. Or something else entirely. Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?

Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?

I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing. Maybe niceness + good interpretability enables you to get through the period where AGIs haven't yet made breakthroughs in AI Alignment?

I don't know how you would describe "true niceness", but I think it's neither of the above.

Agreed. I think "true niceness" is something like, act to maximize people's preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way.

Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?

Depends on the specifics, I think.

As an intuition pump, imagine the kindest, wisest person that you know. Suppose that that person was somehow boosted into a superintelligence and became the most powerful entity in the world. 

Now, it's certainly possible that for any human, it's inevitable for evolutionary drives optimized for exploiting power to kick in at that situation and corrupt them... but let's further suppose that the process of turning them into a superintelligence also somehow removed those, and made the person instead experience a permanent state of love towards everybody.

I think it's at least plausible that the person would then continue to exhibit "true niceness" towards everyone, despite being that much more powerful than anyone else.

So at least if the agent had started out at a similar power level as everyone else - or if it at least simulates the kinds of agents that did - it might retain that motivation when boosted to higher level of power.

Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?

I don't have a strong reason to expect that it'd happen automatically, but if people are thinking about the best ways to actually make the AI have "true niceness", then possibly! That's my hope, at least.

I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing.

Me too!

Kudos for talking about learning empathy in a way that seems meaningfully different and less immediately broken than adjacent proposals.

I think what you should expect from this approach, should it in fact succeed, is not nothing- but still something more alien than the way we empathize with lower animals, let alone higher animals. Consider the empathy we have towards cats... and the way it is complicated by their desire to be a predator, and specifically to enjoy causing fear/suffering. Our empathy with cats doesn't lead us to abandon our empathy for their prey, and so we are inclined to make compromises with that empathy.

Given better technology, we could make non-sentient artificial mice that are indistinguishable by the cats (but their extrapolated volition, to some degree, would feel deceived and betrayed by this), or we could just ensure that cats no longer seek to cause fear/suffering.

I hope that humans' extrapolated volitions aren't cruel (though maybe they are when judged by Superhappy standards). Regardless, an AI that's guaranteed to have empathy for us is not guaranteed, and in general quite unlikely, to have no other conflicts with our volitions; and the kind of compromises it will analogously make will probably be larger and stranger than the cat example.

Better than paperclips, but perhaps missing many dimensions we care about.