Open Problems with Myopia

evhub

Welp, this scoops a bunch of the stuff in my "Why acausal trade matters" chapter. :D Nice!

The DDT idea amuses me. I guess it's maybe the best shot we have, but boy do I get a sense of doom when I imagine that the fate of the world depends on our ability to control/steer/oversee AIs as they become more capable than us in many important ways via keeping them dumb in various other important ways. I guess there's that thing the crocodile wrestlers do where you hold their mouth shut since their muscles for opening are much weaker than their muscles for closing.

I have only skimmed the Cohen et al paper, so I probably just don't understand what's going on, but I don't think that only using the maximum a posteriori world model helps much. Doesn't that just mean you ignore (for planning purposes) possibilities other than the most likely one? If so, then that won't help at all if you think you are probably in a simulation. It would only help in cases where you thought you might be, but probably weren't.

[-]Mark Xu5y100

One way of looking at DDT is "keeping it dumb in various ways." I think another way of thinking about is just designing a different sort of agent, which is "dumb" according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they're just sacrificing utility for no reason.

There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.

I agree that the maximum a posterior world doesn't help that much, but I think there is some sense in which "having uncertainty" might be undesirable.

[-]Daniel Kokotajlo4y60

Also: I think making sure our agents are DDT is probably going to be approximately as difficult as making them aligned. Related: Your handle for anthropic uncertainty is:

never reason about anthropic uncertainty. DDT agents always think they know who they are.

"Always think they know who they are" doesn't cut it; you can think you know you're in a simulation. I think a more accurate version would be something like "Always think that you are on an original planet, i.e. one in which life appeared 'naturally,' rather than a planet in the midst of some larger interstellar civilization, or a simulation of a planet, or whatever. Basically, you need to believe that you were created by humans but that no intelligence played a role in the creation and/or arrangement of the humans who created you. Or... no role other than the "normal" one in which parents create offspring, governments create institutions, etc. I think this is a fairly specific belief, and I don't think we have the ability to shape our AIs beliefs with that much precision, at least not yet.

[-]John_Maxwell5y30

We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.

Are you sure that "episode" is the word you're looking for here?

https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL

I'm especially confused because you switched to using the word "timestep" later?

Having an action which modifies the reward on a subsequent episode seems very weird. I don't even see it as being the same agent across different episodes.

Also...

Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.

Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.

I think as a practical matter, the result depends entirely on the method you're using to solve the MDP and the rewards that your simulation delivers.

[-]evhub5y50

Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.

[-]Charlie Steiner5y30

(Edited for having an actual point)

You mention some general ways to get non-myopic behavior, but when it comes to myopic behavior you default to a clean, human-comprehensible agent model. I'm curious if you have any thoughts on open avenues related to training procedures that encourage myopia in inner optimizers, even if those inner optimizers are black boxes? I do seem to vaguely recall a post from one of you about this, or maybe it was Richard Ngo.

[-]evhub5y30

I think that trying to encourage myopia via behavioral incentives is likely to be extremely difficult, if not impossible (at least without a better understanding of our training processes' inductive biases). Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift” is a good resource for some of the problems that you run into when you try to do that. As a result, I think that mechanistic incentives are likely to be necessary—and I personally favor some form of relaxed adversarial training—but that's going to require us to get a better understanding of what exactly it looks for an agent to be myopic or not so we know what the overseer in a setup like that should be looking for.

[-]TurnTrout5y30

In some sense, agents that press the button will engage in deception; both agents trade reward now for more reward later.

I don’t understand - isn’t the opposite true here?

[-]Mark Xu5y10

Yep - I switched the setup at some point and forgot to switch this sentence. Thanks.

[-]Evan R. Murphy4y00

I think there may be another leftover from the old setup:

We are interested in creating agents that robustly do not press the button.

Shouldn't this be interested in creating agents that robustly do press the button? I.e. then they're reliably myopic. Or am I misunderstanding something?

[-]johnswentworth5y30

(On reflection this comment is less kind than I'd like it to be, but I'm leaving it as-is because I think it is useful to record my knee-jerk reaction. It's still a good post; I apologize in advance for not being very nice.)

In theory, such an agent is safe because a human would only approve safe actions.

... wat.

Lol no.

Look, I understand that outer alignment is orthogonal to the problem this post is about, but like... say that. Don't just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)

[-]Mark Xu5y70

Yeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."

[-]Charlie Steiner5y30

You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.

"A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.

[-]Mark Xu5y30

has been changed to imitation, as suggested by Evan.

[-]evhub5y20

Yeah, I agree—the example should probably just be changed to be about an imitative amplification agent or something instead.

see Multiverse-wide cooperation via correlated decision making – Summary for a brief explanation of superrationality and how it differs from acausal trade. ↩︎
Here, we apply our intuition that defection is a more myopic action than cooperation. ↩︎

	C	D
C	(1, 1)	(-1, 2)
D	(2, -1)	(-1, -1)

	C	D
C	(1, 1)	(0.5, 0.5)
D	(0.5, 0.5)	(-1, -1)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

33

Open Problems with Myopia

33

Introduction

Open Problems

Superrationality^[1]

Anthropic uncertainty

Counterfactual Mugging

Safety is goal dependent

Potential Research Directions

Dumb decision theory

Existing work

Decoupling

Existing Work

Conclusion

33

Open Problems with Myopia

33

Introduction

Open Problems

Superrationality[1]

Anthropic uncertainty

Counterfactual Mugging

Safety is goal dependent

Potential Research Directions

Dumb decision theory

Existing work

Decoupling

Existing Work

Conclusion

Superrationality^[1]