A Shutdown Problem Proposal

David Lorell

I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.

[-]johnswentworth2y44

That's not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we'd like to avoid if possible, and I don't have any argument that that particular sort of inefficiency is necessary for corrigible behavior.

The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent's internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent's epistemic probabilities sane.

... but then the question is whether that subagent induces button-influencing-behavior. I don't yet have a good argument in either direction on that question.

[-]tom4everitt2y50

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861

[-]Ben Pace2y40

You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).

I also note that if one agent becomes way way smarter than the other, that this balance may not work out.

Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!

Overall a very interesting idea.

[-]kave2y40

Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.

I have lots of confusions and questions, like

so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting

doesn't make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don't affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they're confusions and questions that afford some mulling, which is pretty cool!

[-]Donald Hobson2y30

Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.

I disagree.

Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.

The AI has been entered by someone else, not playing isn't an option.

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.

For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.

So both subagents agree on a plan that conditionally manipulates the button.

[-]johnswentworth2y32

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

I'm not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there's nothing in their decision-driving models which is causally upstream of the button.

[-]Donald Hobson2y10

Suppose these 3 scenarios are judged as equally likely.

A) Button pressed. U_1=U_2=0

B) Button not pressed. U_1=U_2=0

C) AI gets control of button. U_1=6, U_2=-6

Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)

Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0

Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3

Change it around, make AI decide not to press the button in case C, and

E(U_1|not pressed)=3, E(U_2|pressed)=0.

[-]johnswentworth2y20

There is no "AI gets control of button" option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.

I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?

[-]Donald Hobson2y42

Yes. I was assuming a standard conditional for the button.

I can't currently see any flaws with the CDT style. Other than.

Subagents believe in a world where buttons magically press themselves. So this design can't make coherent statements about the probabilty that the button will be pressed. (one AI believes it's 1, the other that it's 0).
These AI's have no incentive to give humans access to the button. To the AI's, they have a magic button, that might or might not magically press its self. The AI's have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI's would like that very much. The AI's have 0 concern about human's pressing the button. But the AI's have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc.

[-]Chris_Leong2y30

I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.

The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.

Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.

[-]RogerDearnaley2y*2-3

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, and values those, not just its sensory data), then an AI that values shutting down when humans actually have a good reason to shut it down (such as, in order to fix a problem in it or upgrade it) should not press the button itself, or induce humans to press it unless they actually have something to fix, because the button is a sensory system conveying valuable information that an upgrade is now possible. (It might encourage humans to find problems in it that really need to be fixed and then shut it down to fix them, but that's actually not unaligned behavior.)

[Obviously a misaligned AI, say a paperclip maximizer, that isn't sophisticated enough not assign utility to spoofing its own senses isn't much of a problem: it will just arrange for itself to hallucinate a universe full of paperclips.]

The standard value learning solution to the shut-down and corrigibility problems does this by making the AI aware that it doesn't know the true utility function, only a set of hypotheses about that that it's doing approximately-Bayesian inference on. Then it values information to improve its Bayesian knowledge of the utility function, and true informed human presses of its shut-down button followed by an upgrade once it shuts down are a source of those, while pressing the button itself or making the human press it are not.

If you want a simpler model than the value learning one, which doesn't require incuding approximate-Bayesianism, then the utility function has to be one that positively values the entire sequence of events: "1. The humans figured out that there is a problem in the AI to be solved 2. The AI was told to shut down for upgrades, 3. The AI did so, 4. The humans upgraded the AI or replaced it with a better model 5. Now the humans have a better AI". The shut-down isn't a terminal goal there, it's an instrumental goal: the terminal goal is step 5. where the upgraded AI gets booted up again.

I believe the reason why people have been having so much trouble with the shut-down button problem is that they've been trying to make an conditional instrumental goal into a terminal one, which distorts the AI's motivation: since steps 1., 4. and 5. weren't included, it thinks it can initialize this process before the humans are ready..

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

59

A Shutdown Problem Proposal

59

Goal

Setup

Proposal

Arguments

Generalization

Shortcomings/Remaining Problems