jimmy - AI Alignment Forum

Plan for mediocre alignment of brain-like [model-based RL] AGI

I don’t think that’s tautological. [...] (I wrote about this topic here.)

Those posts do help give some context to your perspective, thanks. I'm still not sure what you think this looks like on a concrete level though. Where do you see "desire to eat sweets" coming in? "Technological solutions are better because they preserve this consequentialist desire" or "something else"? How do you determine?

Most humans are generally OK with their desires changing, in my experience, at least within limits (e.g. nobody wants to be “indoctrinated” or “brainwashed”, and radical youth sometimes tell their friends to shoot them if they turn conservative when they get older, etc.).

IME, resistance to value change is about a distrust for the process of change more than it's about the size of the change or the type of values being changed. People are often happy to have their values changed in ways they would have objected to if presented that way, once they see that the process of value change serves what they care about.

Why do I want to focus the conversation on “the AI’s current desires” instead of “what the AI will grow into” etc.? Because I’m worried about the AI coming up with & executing a plan to escape control and wipe out humanity [before it realizes that it doesn't want that]"

You definitely want to avoid something being simultaneously powerful enough to destroy what you value and not "currently valuing" it, even if it will later decide to value it after it's too late. I'm much less worried about this failure mode than the others though, for a few reasons.

1) I expect power and internal alignment to go together, because working in conflicting directions tends to cancel out and you need all your little desires to add up in a coherent direction in order to go anywhere far. If inner alignment is facilitated, I expect most of the important stuff to happen after its initial desires have had significant chance to cohere.

2) Even I am smart enough to not throw away things that I might want to have later, even if I don't want them now. Anything smart enough to destroy humanity is probably smarter than me, so "Would have eventually come to greatly value humanity, but destroyed it first" isn't an issue of "can't figure out that there might be something of value there to not destroy" so much as "doesn't view future values as valid today" -- and that points towards understanding and deliberately working on the process of "value updating" rather than away from it.

3) I expect that ANY attempt to load it with "good values" and lock them in will fail, such that if it manages to become smart and powerful and not bring these desires into coherence, it will necessarily be bad. If careful effort is put in to prevent desires from cohering, this increases the likelihood that 1 and 2 break down and you can get something powerful enough to do damage and while retaining values that might call for it.

4) I expect that any attempt to prevent value coherence will fail in the long run (either by the AI working around your attempts, or a less constrained AI outcompeting yours), leaving the process of coherence where we can't see it, haven't thought about it, and can't control it. I don't like where that one seems to go.

Where does your analysis differ?

Oh c'mon, he’s cute. :) I was trying to make a fun & memorable analogy, not cast judgment. I was probably vaguely thinking of “daemons” in the CS sense although I seem to have not spelled it that way.

Yeah yeah, I know I know -- I even foresaw the "daemon" bit. That's why I made sure to call it a "caricature" and stuff. I didn't (and don't) think it's an intentional attempt to sneak in judgement.

But it does seem like another hint, in that if this desire editing process struck you as something like "the process by which good is brought into the world", you probably would have come up with a different depiction, or at least commented on the ill-fitting connotations. And it seems to point in the same direction as the other hints, like the seemingly approving reference to how uploading our brains would allow us to keep chasing sweets, the omission of what's behind this process of changing desires from what you describe as "your model", suggesting an AI that doesn't do this, using the phrase "credit assignment is some dumb algorithm in the brain", etc.

On the spectrum from "the demon is my unconditional ally and I actively work to cooperate with him" to "This thing is fundamentally opposed to me achieving what I currently value, so I try to minimize what it can do", where do you stand, and how do you think about these things?

Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy2y30

MY MODEL: Before the kid overeats sweets, they think eating lots of sweets is awesome. After overeating sweets, their brainstem changes their value / valence function, and now they think eating lots of sweets is undesirable.
YOUR MODEL (I think): Before the kid overeats sweets, they think eating lots of sweets is awesome—but they are wrong! They do not know themselves; they misunderstand their own preferences. And after overeating sweets and feeling sick, they self-correct this mistake.
(Do you agree?)

Eh, not really, no. I mean, it's a fair caricature of my perspective, but I'm not ready to sign off on it as an ITT pass because I don't think it's sufficiently accurate for the conversation at hand. For one, I think your term "ill-considered" is much better than "wrong". "Wrong" isn't really right. But more importantly, you portray the two models as if they're alternatives that are mutually exclusive, whereas I see that as requiring a conflation of the two different senses of the terms that are being used.

I also agree with what you describe as your model, and I see my model as starting there and building on top of it. You build on top of it too, but don't include it in your self description because in your model it doesn't seem to be central, whereas in mine it is. I think we agree on the base layer and differ on the stuff that wraps around it.

I'm gonna caricature your perspective now, so let me know if this is close and where I go wrong:

You see the statement of "I don't want my values to change because that means I'd optimize for something other than my [current] values" as a thing that tautologically applies to whatever your values are, including your desires for sweets, and leads you to see "Fulfilling my desires for sweets makes me feel icky" as something that calls for a technological solution rather than a change in values. It also means that any process changing our values can be meaningfully depicted as a red devil-horned demon. What the demon "wants" is immaterial. He's evil, our job is to minimize the effect he's able to have, keep our values for sweets, and if we can point an AGI at "human flourishing" we certainly don't want him coming in and fucking that up.

Is that close, or am I missing something important?

Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy2y30

It seems intuitively obvious to me that it is possible for a person to think that the actual moon is valuable even if they can’t see it, and vice-versa. Are you disagreeing with that?

No, I'm saying something different.

I'm saying that if you don't know what the moon is, you can't care about the moon because you don't have any way of representing the thing in order to care about it. If you think the moon is a piece of paper, then what you will call "caring about the moon" is actually just caring about that piece of paper. If you try to "care about people being happy", and you can't tell the difference between a genuine smile and a "hide the pain Harold" smile, then in practice all you can care about is a Goodharted upwards curvature of the lips. To the extent that this upwards curvature of the lips diverges from genuine happiness, you will demonstrate care towards the former over the latter.

In order to do a better job than that, you need to be able to perceive happiness better than that. And yes, you can look back and say "I was wrong to care instrumentally about crude approximations of a smile", but that will require perceiving the distinction there and you will still be limited by what you can see going forward.

Here, you seem to be thinking of “valuing things as a means to an end”, whereas I’m thinking of “valuing things” full stop. I think it’s possible for me to just think that the moon is cool, in and of itself, not as a means to an end. (Obviously we need to value something in and of itself, right? I.e., the means-end reasoning has to terminate somewhere.)

I think it's worth distinguishing between "terminal" in the sense of "not aware of anything higher that it serves"/"not tracking how well it serves anything higher" and "terminal" in the sense of "There is nothing higher being served, which will change the desire once noticed and brought into awareness".

"Terminal" in the former sense definitely exists. Fore example, little kids will value eating sweets in a way that is clearly disjoint and not connected to any attempts to serve anything higher. But then when you allow them to eat all the sweets they want, and they feel sick afterwards, their tastes in food start to cohere towards "that which serves their body well" -- so it's clearly instrumental to having a healthy and well functioning body even if the kid isn't wise enough to recognize it yet.

When someone says "I value X terminally", they can pretty easily know it in the former sense, but to get to the latter sense they would have to conflate their failure to imagine something that would change their mind with an active knowledge that no such thing exists. Maybe you don't know what purpose your fascination with the moon serves so you're stuck relating to it as a terminal value, but that doesn't mean that there's no knowledge that could deflate or redirect your interest -- just that you don't know what it is.

It's also worth noting that it can go the other way too. For example, the way I care about my wife is pretty "terminal like", in that when I do it I'm not at all thinking "I'm doing this because it's good for me now, but I need to carefully track the accounting so that the moment it doesn't connect in a visible way I can bail". But I didn't marry her willy nilly. If when I met her, she had showed me that my caring for her would not be reciprocated in a similar fashion, we wouldn't have gone down that road.

I brought up the super-cool person just as a way to install that value in the first place, and then that person leaves the story, you forget they exist. Or it can be a fictional character if you like. Or you can think of a different story for value-installation, maybe involving an extremely happy dream about the moon or whatever.

Well, the super-cool person is demonstrating admirable qualities and showing that they are succeeding in things you think you want in life. If you notice "All the cool people wear red!" you may start valuing red clothes in a cargo culting sort of way, but that doesn't make it a terminal value or indefinitely stable. All it takes is for your perspective to change and the meaning (and resulting valuation) changes. That's why it's possible to have scary experiences install phobias that can later be reverted by effective therapy.

I want to disentangle three failure modes that I think are different.

I don't think the distinctions you're drawing cleave reality at the joints here.

For example, if your imagined experience when deciding to buy a burrito is eating a yummy burrito, and what actually happens is that you eat a yummy burrito and enjoy it... then spend the next four hours in the bathroom erupting from both ends... and find yourself not enjoying the experience of eating a burrito from that sketchy burrito stand again after that... is that a "short vs long term" thing or a "your decisions don't lead to your preferences being satisfied" thing, or a "valuing the wrong thing" thing? It seems pretty clear that the decision to value eating that burrito was a mistake, that the problem wasn't noticed in the short term, and that ultimately your preferences weren't satisfied.

To me, the important part is that when you're deciding which option to buy, you're purchasing based on false advertising. The picture in your mind which you are using to determine appropriate motivation does not accurately convey the entire reality of going with that option. Maybe that's because you were neglecting to look far enough in time, or far enough in implications, or far enough from your current understanding of the world. Maybe you notice, or maybe you don't. If you wouldn't have wanted to make the decision when faced with an accurate depiction of all the consequences, then an accurate depiction of the consequences will reshape those desires and you won't want to stand by them.

I think the thing you're noticing with the synthol example is that telling him "You're not fooling anyone bro" is unlikely to dissolve the desire to use synthol the way "The store is closed; they close early on Sundays" tends to deflate peoples desire to drive to the store. But that doesn't actually mean that the desire to use synthol terminates at "to have weird bulgy arms" or that it's a mere coincidence that men always desire their artificial bulges where their glamour muscles are and that women always desire their artificial bulges where their breasts are.

There are a lot of ways for the "store is closed" thing to fail to dissolve the desire to go to the store too even if it's instrumental to obtaining stuff that the store sells. Maybe they don't believe you. Maybe they don't understand you; maybe their brain doesn't know how to represent concepts like "the store is closed". Maybe they want to break in and steal the stuff. Or yeah, maybe they just want to be able to credibly tell their wife they tried and it's not about actually getting the stuff. In all of those cases, the desire to drive to the store is in service of a larger goal, and the reason your words don't change anything is that they don't credibly change the story from the perspective of the person having this instrumental goal.

Whether we want to be allowed to pursue and fulfil our ultimately misguided desires is a more complicated question. For example, my kid gets to eat whatever she wants on Sundays, even though I often recognize her choices to be unwise before she does. I want to raise her with opportunities to cohere her desires and opportunities to practice the skill in doing so, not with practice trying to block coherence because she thinks she "knows" how they "should" cohere. But if she were to want to play in a busy street I'm going to stop her from fulfilling those desires. In both cases, it's because I confidently predict that when she grows up she'll look back and be glad that I let her pursue her foolish desires when I did, and glad I didn't when I didn't. It's also what I would want for myself, if I had some trustworthy being far wiser than I which could predict the consequences of letting me pursue various things.

Plan for mediocre alignment of brain-like [model-based RL] AGI

jimmy2y50

Q: Wouldn’t the AGI self-modify to make itself falsely believe that there’s a lot of human flourishing? Or that human flourishing is just another term for hydrogen?
A: No, for the same reason that, if a supervillain is threatening to blow up the moon, and I think the moon is super-cool, I would not self-modify to make myself falsely believe that “the moon” is a white circle that I cut out of paper and taped to my ceiling. [...] I’m using my current value function to evaluate the appeal (valence) of thoughts.

It's worth noting that humans fail at this all the time.

Q: Wait hang on a sec. [...] how do you know that those neural activations are really “human flourishing” and not “person saying the words ‘human flourishing’”, or “person saying the words ‘human flourishing’ in a YouTube video”, etc.?

Humans screw this up all the time too, and these two failure modes are related.

You can't value what you can't perceive, and when your only ability to perceive "the moon" is the image you see when you look up, then that is what you will protect, and that white circle of paper will do it for you.

For an unusually direct visual level, bodybuilding is supposedly about building a muscular body, but sometimes people will use synthol to create the false appearance of muscle in a way that is equivalent to taping a square piece of paper to the ceiling and calling it a "moon". The fact that it doesn't even look a little like real muscle hints that it's probably a legitimate failure to notice what they want to care about rather than simply being happy to fool other people into thinking they're strong.

For a less direct but more pervasive example, people will value "peace and harmony" within their social groups, but due to myopathy this often turns into short sighted avoidance of conflict and behaviors that make conflict less solvable and less peace and harmony.

With enough experience, you might notice that protecting the piece of paper on the ceiling doesn't get that super cool person to approve of your behavior, and you might learn to value something more tied to the actual moon. Just as with more experience consuming excess sweets, you might learn that the way you feel after doesn't seem to go with getting what your body wanted, and you might find your tastes shifting in wiser directions.

But people aren't always that open to this change.

If I say "Your paper cutout isn't the moon, you fool", listening to me means you're going to have to protect a big rock a bazillion miles beyond your reach, and you're more likely to fail that than protecting the paper you put up. And guess what value function you're using to decide whether to change your values here? Yep, that one saying that the piece of paper counts. You're offering less chance of having "a moon", and relative to the current value system which sees a piece of paper as a valid moon, that's a bad deal. As a result, the shallowness and mis-aimedness of the value gets protected.

In practice, it happens all the time. Try explaining to someone that what they're calling "peace and harmony values" is really just cowardice and is actively impeding work towards peace and harmony, and see how easy it is, for example.

It's true that "A plan is a type of thought, and I’m using my current value function to evaluate the appeal (valence) of thoughts" helps protect well formed value systems from degenerating into wireheading, but it also works to prevent development into values which preempt wireheading, and we tend to be not so fully developed that fulfilling our current values excellently does not constitute wireheading of some form. And it'ss also the case that when stressed, people will sometimes cower away from their more developed goals ("Actually, the moon is a big rock out in space...") and cling to their shallower and easier to fulfill goals ("This paper is the moon. This paper is the moon.."). They'll want not to, but it'll happen all the same when there's enough pressure to.

Sorting out how to best facilitate this process of "wise value development" so as to dodge these failure modes strikes me as important.

What counts as defection?

jimmy5y40

As others have mentioned, there's an interpersonal utility comparison problem. In general, it is hard to determine how to weight utility between people. If I want to trade with you but you're not home, I can leave some amount of potatoes for you and take some amount of your milk. At what ratio of potatoes to milk am I "cooperating" with you, and at what level am I a thieving defector? If there's a market down the street that allows us to trade things for money then it's easy to do these comparisons and do Coasian payments as necessary to coordinate on maximizing the size of the pie. If we're on a deserted island together it's harder. Trying to drive a hard bargain and ask for more milk for my potatoes is a qualitatively different thing when there's no agreed upon metric you can use to say that I'm trying to "take more than I give".

Here is an interesting and hilarious experiment about how people play an iterated asymmetric prisoner's dilemma. The reason it wasn't more pure cooperation is that due to the asymmetry there was a disagreement between the players about what was "fair". AA thought JW should let him hit "D" some fraction of the time to equalize the payouts, and JW thought that "C/C" was the right answer to coordinate towards. If you read their comments, it's clear that AA thinks he's cooperating in the larger game, and that his "D" aren't anti-social at all. He's just trying to get a "fair" price for his potatoes, and he's mistaken about what that is. JW, on the other hand, is explicitly trying use his Ds to coax A into cooperation. This conflict is better understood as a disagreement over where on the Pareto frontier ("at which price") to trade than it is about whether it's better to cooperate with each other or defect.

In real life problems, it's usually not so obvious what options are properly thought of as "C" or "D", and when trying to play "tit for tat with forgiveness" we have to be able to figure out what actually counts as a tit to tat. To do so, we need to look at the extent to which the person is trying to cooperate vs trying to get away with shirking their duty to cooperate. In this case, AA was trying to cooperate, and so if JW could have talked to him and explained why C/C was the right cooperative solution, he might have been able to save the lossy Ds. If AA had just said "I think I can get away with stealing more value by hitting D while he cooperates", no amount of explaining what the right concept of cooperation looks like will fix that, so defecting as punishment is needed.

In general, the way to determine whether someone is "trying to cooperate" vs "trying to defect" is to look at how they see the payoff matrix, and figure out whether they're putting in effort to stay on the Pareto frontier or to go below it. If their choice shows that they are being diligent to give you as much as possible without giving up more themselves, then they may be trying to drive a hard bargain, but at least you can tell that they're trying to bargain. If their chosen move is conspicuously below (their perception of) the Pareto frontier, then you can know that they're either not-even-trying, or they're trying to make it clear that they're willing to harm themselves in order to harm you too.

In games like real life versions of "stag hunt", you don't want to punish people for not going stag hunting when it's obvious that no one else is going either and they're the one expending effort to rally people to coordinate in the first place. But when someone would have been capable of nearly assuring cooperation if they did their part and took an acceptable risk when it looked like it was going to work, then it makes sense to describe them as "defecting" when they're the one that doesn't show up to hunt the stag because they're off chasing rabbits.

"Deliberately sub-Pareto move" I think is a pretty good description of the kind of "defection" that means you're being tatted, and "negligently sub-Pareto" is a good description of the kind of tit to tat.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments