User Comment Replies — AI Alignment Forum

Plan for mediocre alignment of brain-like [model-based RL] AGI

I don’t think that’s tautological. [...] (I wrote about this topic here.)

Those posts do help give some context to your perspective, thanks. I'm still not sure what you think this looks like on a concrete level though. Where do you see "desire to eat sweets" coming in? "Technological solutions are better because they preserve this consequentialist desire" or "something else"? How do you determine?

Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”,

... (read more)

2Steve Byrnes2y

Thanks! It is obviously possible to be happy or sad (or both in different situations) about the fact that brainstem rewards will change your desires in the future. This would be a self-reflective desire: I don’t want to make claims about what desires in this category are wise or unwise for a human; I make no pretense to wisdom :) For my part: * I’ve heard good things about occasionally using tobacco to help focus (like how I already use coffee), but I’m terrified to touch it because I’m concerned I’ll get addicted. Bad demon! * I’m happy to have my personality & preferences “naturally” shift in other ways, particularly as a result of marriage and parenthood. Good demon! * …And obviously, without the demon, I wouldn’t care about anything at all, and indeed I would have died long ago. Good demon! Anyway, I feel like we’re getting off-track: I’m really much more interested in talking about AI alignment than about humans. Plausibly, we will set up the AI with a process by which its desires can change in a hopefully-human-friendly direction (this could be real-time human feedback aided by interpretability, or a carefully-crafted hardcoded reward function, or whatever), and this desire-editing process is how the AI will come to be aligned. Whatever that desire-editing process is: * I’d like the AI to be happy that this process is happening * …or if the AI is unhappy about this process, I’d like the AI to be unable to do anything to stop that process. * I’d like this process to “finish” (or at least, “get quite far”) long before the AI is able to take irreversible large-scale actions in the world. * (Probably) I’d like the AI to eventually get to a point where I trust the AI’s current desires even more than I trust this desire-editing process, such that I can allow the AI to edit its own desire-editing code to fix bugs etc., and I would feel confident that the AI will use this power in a way that I’m happy about. I have two categories of alignment plans that

Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy2y30

MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
(Do you agree?)

Eh, not really, no. I mean, it's a fair caricature of my ... (read more)

2Steve Byrnes2y

I don’t think that’s tautological. I think, insofar as an agent has desires-about-states-of-the-world-in-the-distant-future (a.k.a. consequentialist desires), the agent will not want those desires to change (cf. instrumental convergence), but I think agents can other types of desires too, like “a desire to be virtuous” or whatever, and in that case that property need not hold. (I wrote about this topic here.) Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.). In the case of AI: * if the AI’s current desires are bad, then I want the AI to endorse its desires changing in the future; * if the AI’s current desires are good, then I want the AI to resist its desires changing in the future. :-P Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity. When the AI is brainstorming possible plans, it’s using its current desires to decide what plans are good versus bad. If the AI has a current desire to wipe out humanity at time t=0, and it releases the plagues and crop diseases at time t=1, and then it feels awfully bad about what it did at time t=2, then that’s no consolation!! Oh c'mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way. I wrote that post a while ago, and the subsequent time I talked about this topic I didn’t use the “demon” metaphor. Actually, I switched to a paintbrush metaphor.

Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy2y30

It seems intuitively obvious to me that it is possible for a person to think that the actual moon is valuable even if they can’t see it, and vice-versa. Are you disagreeing with that?

No, I'm saying something different.

I'm saying that if you don't know what the moon is, you can't care about the moon because you don't have any way of representing the thing in order to care about it. If you think the moon is a piece of paper, then what you will call "caring about the moon" is actually just caring about that piece of paper. If you try to "care abou... (read more)

3Steve Byrnes2y

Thanks!! I want to zoom in on this part; I think it points to something more general: I disagree with the way you’re telling this story. On my model, as I wrote in OP, when you’re deciding what to do: (1) you think a thought, (2) notice what its valence is, (3) repeat. There’s a lot more going on, but ultimately your motivations have to ground out in the valence of different thoughts, one way or the other. Thoughts are also constrained by perception and belief. And valence can come from a “ground-truth reward function” as well as being learned from prior experience, just like in actor-critic RL. So the kid has a concept “eating lots of sweets”, and that concept is positive-valence, because in the past, the ground-truth reward function in the brainstem was sending reward when the kid ate sweets. Then the kid overeats and feels sick, and now and “eating lots of sweets” concept acquires negative valence, because there’s a learning algorithm that updates the value function based on rewards, and the brainstem sends negative reward after overeating and feeling sick, and so the value function updates to reflect that. So I think the contrast is: * MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable. * YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake. (Do you agree?) This is sorta related to the split that I illustrated here (further discussion here): * The “desire-editing demon” is in this case a genetically-hardwired, innate reaction circuit in the brainstem that detects overeating and issues negative reward (along with various other visceral reactions). * The “desire-driven agen

Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy2y50

Q: Wouldn’t the AGI self-modify to make itself falsely believe that there’s a lot of human flourishing? Or that human flourishing is just another term for hydrogen?
A: No, for the same reason that, if a supervillain is threatening to blow up the moon, and I think the moon is super-cool, I would not self-modify to make myself falsely believe that “the moon” is a white circle that I cut out of paper and taped to my ceiling. [...] I’m using my current value function to evaluate the appeal (valence) of thoughts.

It's worth noting that human... (read more)

5Steve Byrnes2y

Thanks! I want to disentangle three failure modes that I think are different. * (Failure mode A) In the course of executing the mediocre alignment plan of the OP, we humans put a high positive valence on “the wrong” concept in the AGI (where “wrong” is defined from our human perspective). For example, we put a positive valence on the AGI’s concept of “person saying the words ‘human flourishing’ in a YouTube video” when we meant to put it on just “human flourishing”. I don’t think there’s really a human analogy for this. You write “bodybuilding is supposedly about building a muscular body”, but, umm, says who? People have all kinds of motivations. If Person A is motivated to have a muscular body, and Person B is motivated to have giant weird-looking arms, then I don’t want to say that Person A’s preferences are “right” and Person B’s are “wrong”. (If Person B were my friend, I might gently suggest to them that their preferences are “ill-considered” or “unwise” or whatever, but that’s different.) And then if Person B injects massive amounts of synthol, that’s appropriate given their preferences. (Unless Person B also has a preference for not getting a heart attack, of course!) * (Failure mode B) The AGI has a mix of short-term preferences and long-term preferences. It makes decisions driven by its short-term preferences, and then things turn out poorly as judged by its long-term preferences. This one definitely has a human analogy. And that’s how I’m interpreting your “peace and harmony” example, at least in part. Anyway, yes this is a failure mode, and it can happen in humans, and it can also happen in our AGI, even if we follow all the instructions in this OP. * (Failure mode C) The AGI has long-term preferences but, due to ignorance / confusion / etc., makes decisions that do not lead to those preferences being satisfied. This is again a legit failure mode both for humans and for an AGI aligned as described in this OP. I think you’re suggesting that the “

What counts as defection?

jimmy5y40

As others have mentioned, there's an interpersonal utility comparison problem. In general, it is hard to determine how to weight utility between people. If I want to trade with you but you're not home, I can leave some amount of potatoes for you and take some amount of your milk. At what ratio of potatoes to milk am I "cooperating" with you, and at what level am I a thieving defector? If there's a market down the street that allows us to trade things for money then it's easy to do these comparisons and do Coasian payments as n... (read more)

3Alex Turner5y

I actually don't think this is a problem for the use case I have in mind. I'm not trying to solve the comparison problem. This work formalizes: "given a utility weighting, what is defection?". I don't make any claim as to what is "fair" / where that weighting should come from. I suppose in the EGTA example, you'd want to make sure eg reward functions are identical. Defection doesn't always have to do with the Pareto frontier - look at PD, for example. (C,C), (C,D), (D,C) are usually all Pareto optimal.

AI ALIGNMENT FORUM
AF

All of jimmy's Comments + Replies