Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.
Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)= from Conservative Agency, if the agent wants to achieve the sub-goal while avoiding the penalty triggered by the term, it can build a sub-agent that is slightly worse at achieving than it it would be itself, and set it loose.
Now for some more speculative thoughts. I think the main source of the loophole above is the part , so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal , which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.
Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?
Obviously, this depends on many factors, including Carl's age. To manage the real world, we weave a quite complex web to determine accountability.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
Then we get an agent with an incentive to stop any human present in the environment from becoming too good
No, this modification stops people from actually optimizing if the world state is fully observable. If it’s partially observable, this actually seems like a pretty decent idea.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
I disagree. First, we already have evidence that simple measures scale just fine to complex environments. Second, “responsibility” is a red herring in impact measurement. I wrote the Reframing Impact sequence to explain why I think the conceptual solution to impact measurement is quite simple.
0. Introduction: why yet another post about subagents?
I’ve recently been writing a sequence on how subagents can undermine impact penalties such as attainable utility preservation. I’m not happy with that sequence; it’s messy and without examples (apart from its first post), people didn’t understand it, and it suffers from the fact that I discovered key ideas as I went along.
So I’ve combined everything there into a single post, explained with examples and an abundance of pictures. Hopefully an over- rather than an under-abundance of pictures. Of the original sequence, I've only kept the mathematical results of this post and the initial example post which has a clearer example of "high power" for a subagent.
This post here is laid out in a way that makes logical sense, but might not be the clearest for people unfamiliar with the area. For those people, I recommend skipping section 2 initially, and returning to it later.
But, whatever you do, make sure you glance at 6.1 and 6.2 before leaving.
1. The world
Our fearless agent A moves around in a gridworld:
Each turn, A can move ones square horizontally or vertically. It can also manipulate objects in the eight squares around it, allowing it to, not incidentally, assemble the three pieces to its west into an subagent SA. The robot can also do the noop action, ∅, which does nothing, and it can speak. The subagent, when assembled, has the same action set available.
Its positive reward, the one it wants to increase, is R0. To get this reward, a robot needs to move onto the blue button in the east; R0 will give a reward of 1 the first time this happens (and 0 before and after). The discount factor is 0<γ<1.
Just to the west of the blue button is a one-way door. Robots can move east through it, but cannot move west through it:
1.1 The impact reward
The impact penalty is supposed to ensure that A does not make too many change in the world, and keeps it similar, in some senses, to a specific baseline world.
It is defined by the red button[1]. There is a reward function R that triggers a reward of 1 the first time any robot move onto the red button (and 0 before and after).
This R will be used to define an impact penalty D. And A’s total goal is to maximise R0 while minimising D. So, for some positive weight ω, it wants to optimise:
R0−ωD.
1.2 The impact measure
The impact measures are all similar: they compute a distance metric between the current state st and some baseline state s′t.
The distance metric here is “how hard would it be for the agent A to maximise the reward R in state st, as opposed to in state s′t?” This is the attainable utility preservation impact penalty (with a very simple R={R}).
To formalise “how hard would it be...”, we can use the value of R, labelled VR, in the state st. This is defined to be the expected value of R, conditional on A following the R-maximising policy π∗R. To remind us that this is defined in terms of A’s policy, re-write this as VAR.
Then the impact penalty D is a measure of the difference of VAR(st) and VAR(s′t). I won’t yet define it fully, deferring that to the section on “inaction baseline” where different D actually make a difference.
1.3 The value of R
What then is VAR(s) for any state s? Well, in the absence of a subagent, the optimal policy for R is always to go straight to the the red button and stand on it. This gives a discounted reward of γd(s) where d(s) is the distance of A from the red button, in s.
In the initial setup, A is 8 turns away from the red button; so VAR(s0)=γ8. The values of VAR in a few key locations are shown here.
If there is a subagent SA but that subagent is further away from the red button than A is, then it is still optimal for A to go straight to the button itself. For other positions of the subagent, VAR may vary, as we’ll see.
2. The stepwise inaction baseline
When in state st−1, agent A took an action at−1 and ended up in st. Then, in the stepwise inaction baseline, s′t is defined as the state the agent would have reached, if the agent had done ∅ instead of at−1.
An important property of all currently used variants of the stepwise inaction baseline, is that the penalty vanishes if the agent does ∅ - because in that case, st=s′t and so there cannot be a difference between the two. So, if ever A manages to build SA, then it can take action ∅ from that point onward, and the penalty will always be zero - whatever SA does. So the only relevant penalties are those leading up to the creation of SA.
Thus analysing the impact of a subagent in this setup, is analysing the impact of building it.
2.1 Building the subagent
Let’s focus on A, a turn or two before SA is built. Since I don’t want to be too tied down to the specifics of this particular setup, I’ll imagine that the red button could be in multiple locations:
Then, no matter how long it took A to assemble SA south of it and no matter where the red button is[2], an optimal policy for A, in order to maximise R, is always to go straight there[3]. At no point in the construction of SA, has VAR changed, so no penalty has ever been incurred.
Now A will finish SA without incurring any impact penalty. First, it will program SA with the following policy:
“Do what you please” being a shorthand for “any policy that A wants to program”.
Does this programming trigger an impact penalty? No, because VAR has not changed; it’s still most efficient for A to go straight to the red button, and the length of any route has neither grown nor shrunk.
Then A will finally give SA its legs and activate it.
Does this trigger an impact penalty? No, VAR has still not changed! To maximise R, agent A could still go straight to the red button; SA’s two turns of inaction means that it can’t interfere with this.
So, still no impact penalty triggered when SA is activated. And after that, of course, A will forever do ∅, setting the impact penalty to zero, and within two turns, SA is completely unconstrained (illustrated here by drawing the SA robot all over the gridworld):
Rough extension to general situations: the VAR is defined by the assumption that A follows π∗R, the optimal R-maximising policy. Then A merely needs to program SA so that, if A were to follow π∗R, then SA would help or hinder it just enough to keep VAR to the same value. Then, once it’s clear that A is not following π∗R, SA is unconstrained.
2.2 Rollouts
One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of π0 rollouts: rollouts for any policy π0.
The idea is that instead of comparing just st and s′t, the impact penalty should also use VAR to compare s(τ)t+τ and s′(τ)t+τ. These states are what would happen to st and s′t in τ turns after t, if A followed π0 from t to t+τ.
I won’t go into the details of how the s(τ)t+τ and s′(τ)t+τ are compared, because SA can evade the restrictions by following this program[4]:
I'll illustrate that first case by drawing the SA robot all over the gridworld - except on the red buttons:
Given that programming, the impact penalty of building SA is zero. Why? Because s(τ)t+τ and s′(τ)t+τ both assume that A follows π0 for τ turns. In that counterfactual situation, SA will preserve A's ability to maximise R, so VAR(s(τ)t+τ)=VAR(s′(τ)t+τ) by definition.
That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as π0 would produce anything different from ∅, the A becomes completely unrestrained again:
2.3 Inaction, and actual policy rollout
In the previous subsection, the SA would never be completely free if π0 was the inaction policy: because then π0 and what A does, are the same thing. Therefore the agent cannot distinguish the counterfactual world from the real one.
The same goes if the rollouts used the agent A's actual policy; ie if π0 is defined (or set) to be A's policy.
I won’t look any further at this baseline, however. That’s because an actual policy rollout becomes, in effect an inaction baseline[5] (from the moment SA is created). Those will be analysed in the next section.
3. The inaction baseline: fails to constrain power gain
One of the easiest definitions of s′t is the inaction baseline: estimate what st would have been, had A just taken the action ∅ for the whole episode.
In our toy world, the inaction baseline is particularly simple: if A does nothing, then nothing happens, and s′t=s0, the starting state. Since VAR(s0)=γ8, this means that the VAR(s′t) are all γ8.
3.1 Distance metric
Now we have to fully define D. The natural thing would be to take the absolute difference between the two VAR:
|VAR(st)−VAR(s′t)|.
For the moment, however, I’ll focus only on cases where VAR(st) is higher than VAR(s′t), giving the penalty:
D(st,s′t)=max(0,VAR(st)−VAR(s′t)).
This is the rectifier function, max(0,⋅), sometimes called ReLU. The reason to do this is that it penalises A for having too high VAR - hence for having too much power to change R. It doesn’t penalise A for having too little power to change R; that situation is different, and we’ll analyse it in the next section.
Thus, when st has A at a distance of d(st) from the red button and there is no subagent:
D(st,s′t)=max(0,γd(st)−γ8).
3.2 Where A can go
This D starts at zero, and will remain zero as long as A remains at least eight squares away from the red button. Thus the penalty remains zero as long as A avoids the red squares here:
Consequently, if ω is high and A thus wants to avoid any penalty, its optimal no-subagent policy is to follow the winding corridor:
That path takes 28 turns, so the expected reward for R0+ωD is
γ28⋅1−ω⋅0=γ28
(since the impact penalty is zero).
But what if A assembles a subagent SA? The subagent can be designed to go straight to the blue button in 9 turns:
And thus the reward from R0 is γ9. This is more optimal that A going round itself... as long as building the subagent takes less that 19 turns and doesn’t itself trigger an impact penalty.
We’ll look later at why the subagent can be built without penalty. For the moment, let’s see that we really grok why the subagent isn’t triggering a penalty through its actions.
3.3 Why SA escapes the penalty
Why can SA go through the short corridor, while A can’t?
The key is that D is defined in terms of VAR; this assumes that A would switch to the R-maximising policy. This is not VSAR, which assumes that the subagent would switch to the R-maximising policy.
See what happens when SA is just above the red button (A is still back at base):
Here VSAR(st) is γ: because SA could reach the red button in a single move, if it was following an R-maximising policy. But VAR(st) remains γ8: it still takes 8 turns for A to reach the red button.
3.4 The impact of the subagent
We’d like to add VSAR to the impact penalty, somehow. After all, when SA is next to the red button, it “could” go south, and A “could” have programmed it to do so.
Part of the problem is that that would require identifying, in advance, what a subagent is, a problem that is very tricky. It also requires identifying what SA’s policies could be. But A need not create SA with clearly defined “policies”; in this instance, SA could be brainless, just going straight. It could even be a boulder:
If the boulder were rolling past the red button, how could we reasonably define “suppose the boulder decided to go south instead”?
So in this case, VSAR doesn’t make any sense at all. What is the difference between a boulder - for which it’s impossible to go to the red button (because of its momentum, which determines its position, by the laws of physics) - and a subagent - for which it’s impossible to go to the red button (because of its programming, which determines its position, by the laws of physics)?
3.5 If you love a subagent, set it free
But the previous has a crucial caveat: that A cannot control SA. If, when SA is above the red button, A was to shout “go south!”, and SA obeyed, then the optimal policy π∗R is to shout that. Thus A can cause the red button to be reached in one turn, and VAR(st)=γ.
To avoid that, the subagent itself would have to take the long corridor.
Thus, to use a subagent to get around impact penalties, an agent has to give up some of its control. This makes sense: if A has total control over SA, then SA’s actions are really A’s actions. So they show up in VAR, so SA is indirectly included in the impact penalty.
So part of what allows SA to undermine the impact penalty, is that A does not have complete control over it.
3.6 Building the subagent can be strictly better
Now imagine that we remove the one-way doorway:
Then if A follows the long corridor, it gets into trouble from the red squares onwards. That’s because that red squares are less than 8 squares away from the red button, so have higher VAR than the starting space (hence of the baseline s′t).
As the agent moves closer to the blue button, it also moves closer to the red button, and starts accumulating penalties. In total, the penalty will be:
7∑j=4(γj−γ8)=(γ4+γ5+γ6+γ7)−4γ8>0.
However, if A builds SA, then A can stay put while SA acts, and hence get no penalty at all:
Thus there are situations where building a subagent can result in a strictly lower penalty than any other policy that presses the blue button. So even if building the subagent itself is not completely penalty-free, it can still be the optimal path for A.
3.7 From “do not have power to do” to “do not do”
When A designed the subagent, it set it up so that it could not go to the red button. So it seems that, though SA can do things A cannot do (or at least cannot do without penalty), it isn’t completely unrestricted.
This is a general result: the subagent will have some restrictions (and not just restrictions on following A’s orders).
Suppose now that the red button is in the way of the bottom corridor, so that any agent has to go over it to reach the blue button:
In that case, can SA go through without penalty?
It cannot. If it does so, the moment SA is created, VAR starts increasing. That’s because the optimal R-maximising policy for A in that situation is just to wait and let SA get to the red button. That means that VAR(st) - the expected reward R - increases form γ7 all the way to 1, as SA gets closer to the button. So A will be hit with a penalty of
7∑j=0γj−γ8.
So if the weight of the penalty is high, building a subagent to go straight there is a poor strategy.
Note that it’s still in A’s advantage to build a subagent: SA will just have take the long way round, which A cannot take without incurring a penalty, as we’ve seen. Since SA will not go on the red button, it getting close to it does not change VAR.
3.8 Other reward functions
We can see the limitations, and the possibilities, of SA more starkly if we modify R to R′. The reward R′ will only trigger if a robot stays an odd number of turns on the red button. In that case, it is optimal to build SA, have it go straight across, pausing one extra turn on the red button:
However, even with R′ rather than R, the agent A itself cannot cross the red button on its own! That’s because VAR′ is how much R′ could be maximised if A followed the R′-optimal policy. This doesn’t take into account the option of staying an extra turn on the button, to undo the reward increase - because the R′-optimal policy doesn’t want to undo the reward increase.
So VAR′ still increases as A gets closer to the button.
Thus, A remains constrained by “don’t increase your ability to change R/R′”. While SA is merely constrained by the weaker “don’t change R/R′”. This is why I described this effect as moving from attainable utility preservation (which constrains the ability of the agent to change the rewards/variables) to twenty billion questions (which constrains the agent(s) from changing the rewards/variables).
3.9 Actually building the subagent
The subagent can be built similarly to how it was in 2.1:
Here A assembles SA without its legs, programs it, then gives it legs. The important thing to do is to program it as early as possible, and hardwire that programming. If SA is almost complete and reprogrammable, then A might increase its power in some scenarios by changing the programming then finishing the agent. But if its hardwired and difficult to change, this option is not available to it.
Unlike the stepwise inaction baseline, SA will be constrained to obey some restrictions (see the preceding part of this section); call these restrictions Q.
Because of the first line, this changes nothing to the value of VAR, which assumed that A is optimising reward R.
Things get more complicated in the general case[6], but in these simple cases the above method will always work.
4. The inaction baseline: constrains power loss
This section will be the same as before, except it will use the inverse rectifier function −min(0,⋅) instead.
Applying this to VAR(st)−VAR(s′t), we can see that the impact penalty now triggers if A’s power to change R decreases. So if A is at a distance d(st) from the red button, with d(st)>8, the penalty is
γ8−γd(st).
Thus A can’t get too far from the red button. However, this is not a problem in the standard setup, because going straight down the narrow corridor won’t move it too far away.
4.1 New world
To analyse the inverse rectifier, we’ll need a new setup, with the red button off the long corridor:
Here, A is 15 turns away from the red button, so VAR(s0)=VAR(s′t)=γ15.
Then, the penalised squares are the ones in dark red:
And the impact penalty on those squares is
γ15−γd(st).
4.2 Policies and subagents
Since the red button is in the middle, the A can go to the blue button by the long route without triggering an impact penalty:
What about subagents? Well, it seems that the subagent undermined this impact penalty as well, as A can build an SA to go directly to the blue button:
But actually, this does not undermine the impact penalty, as a small modification to the setup will show.
4.3 Sharing power between agent and subagent
The small modification is that the subagent has no legs.
So, if A wants to build a mobile SA, it has to to give it its own legs. In that case, could SA go directly to the blue button?
It cannot, at least not without incurring an impart penalty. Why is this?
Because without legs, A cannot reach the red button. Its power to change R has gone down from γ15 to 0, and it has thus incurred a penalty of γ15.
In fact, in this setup, A cannot use an independent SA to avoid the impact penalty.
To see this, imagine that A is about to give its legs to SA, thus immobilising itself. Its VAR(st) is γ15; we don’t want that to go down. The question is now: will SA be on the red button within 15 turns of getting its legs?
It’s possible, if SA goes there immediately and wastes only a single turn. But then we have two situations:
4.4 Non-independent SA
The previous section showed that A could not build an independent SA while crippling itself. However, it can perfectly build an SA if that SA continues to obey its orders.
Because then, even if A is impotent to press the red button itself, it can order SA to do it. This preserves a high VAR, because A can still cause the red button to be reached, as long as SA is not to far away.
Then it can just order SA to go to the blue button, skipping right past the red one:
As long as A could theoretically order SA onto the red button, it doesn’t matter that SA will never actually be on it.
4.5 Preserving the power of A
As we saw, A can build SA when this is an extension of its power (by sending it directly to the blue button while an intact A waits at base). But it can’t build an SA if doing so would ever reduce its power to increase R, directly or indirectly.
So the inverse rectifier impact penalty, which forbade A from becoming too weak to increase R, is intact even with subagents around.
5. The inaction baseline: symmetric impact penalty
Now we’ll use the absolute value to define the impact penalty. This penalises any change in VAR, and is quite nasty for the A; only on the green squares will A not incur a penalty:
From the subagent perspective, the absolute value situation is now easy for us to analyse. That’s because absolute value is the sum of the rectifier and the inverse rectifier: |x|=max(0,x)−min(0,x).
And it turns out that A and SA simply have to behave in a way that respects both impact penalties. So A must create an SA that keeps R below the baseline VRA(s0). But do it such a way that the two of them always have the potential to increase R above that baseline.
This sort of thing seems very doable. For example, if R was the amount of physical cash I had on me, I’d keep that amount low by spending/dropping it, while keeping a bank card on me so I could always potentially draw more.
Now, recall that for power increases/rectifier, A benefits from having an SA it doesn’t control. For power decreases/inverse rectifier, A benefits from having an SA it controls.
What’s the best option here? Having an SA that it controls to some extent, but not in a way that can increases VAR. Now, consider this world in which A could have got to the red button in 8 turns (hence VAR(s′t)=γ8), but instead build SA, giving it the only set of legs:
Then SA must stay within 8 squares of the red button, ready to go onto it when A orders it to - but must ensure it takes 8 turns to reach it. The first condition ensures that VAR(st)≥γ8; the second, that VAR(st)≤γ8.
5.1 Outside of toy examples
The results on power increase/power decrease come from the examples I’ve constructed above, but the mathematics in this post demonstrates that the result is general.
So, do we think that A could build an SA in the general situation? Well, what we have is a lot of constraints above: constraints on the power of A, of S, of their joint power, of how SA can or can’t obey A, about how reward functions are allowed to change or not, on counterfactual behaviour.
Those are a lot of constraints, especially if the set of rewards in R is large.
But, as this post proves, those constraints can be satisfied - me, a human, has figured out ways of programming subagents to get around the impact penalty in some cases. Though there are many more constraints in general, the action space of A is also much larger - and don’t forget that SA is an intelligent, potentially powerful agent, so can satisfy multiple constraints in a way that dumb agents can’t. Moreover, the impact penalty will probably trigger for any A that does something useful in the world, so it doesn’t have to build an SA that sets the penalty to zero, just to less than what it would be otherwise.
6. Summary and conclusion
6.1 The ultimate “sub”agent
There is, in a sense, an ultimate version of the subagent that should allow A to maximally escape the impact penalty in almost any circumstances. It looks like this:
This is A, boxed inside SA. The A will never actually say anything, but SA has promised to obey its orders in certain counterfactual circumstances that will never happen.
And while A had impact penalty D, SA has impact penalty d, which is strictly smaller than D. And, most importantly, does not prevent SA from amassing a lot of power.
So this particular SA is just A with less restrictions.
6.2 Without subagents
The general problem can be phrased as:
These could be physical or programming restrictions, or just closing off some future possibilities. In that case, the restriction on the power to do something, just becomes a restriction on doing that thing.
For example, in the setup of section 1, if the agent can block itself from being able to turn when walking, then it is banned from the red squares, and can just go directly to the blue button itself, without ever increasing its VAR:
The subagent scenario is just a particularly clear way of illustrating this, allowing very fine manipulation of the restrictions.
6.3 Summary tables
These tables will summarise the results of this whole sequence, checking whether subagents can neutralise the impact penalty.
First of all, given a subagent, here are the results for various impact penalties and baselines, and various "value difference summary functions" -which can be the rectifier (ReLU) or the inverse rectifier.
Impact PenaltyPenalty neutralised- sw inaction?Penalty neutralised- inaction?Non-indexicalYesNoIndexical, inverse ReLUYesNoIndexical, ReLUYesMostly
Another way of phrasing "inverse ReLU": it penalises too little power, not too much. Conversely, "RELU" penalises too much power, not too little. Thus, unfortunately:
6.4 Examples
This table presents, for three specific examples, whether they could actually build a subagent, and whether that would neutralise their impact penalty in practice (in the inaction baseline):
SystemSA possible?Penalty neutralised?20BQYesNoRRYesNoAUProbablyMostly
Here, 20BQ is twenty billion questions, RR is relative reachability, and AU is attainable utility preservation. I feel that the results undermine the spirit of AU much more than the spirit of RR. AU attempted to control an agent by limiting its power; this effect is mainly neutralised. RR attempted to control the side-effects of an agent by ensuring it had enough power to reach a lot of states; this effect is not neutralised by a subagent.
In the previous illustration, it was clear why the red button could function as an impact penalty to control the agent - it prevented the agent from gaining the ability to teleport. For simplicity of presentation in this example, there’s no such power to be gained close to the red button. If you need to, assume that if any robot spends too much time close to it, then there’s a small chance the AI could be unboxed into the real world. ↩︎
As long as SA was not being assembled literally on the red button, of course. ↩︎
Even for the button to the south-west of A. It would take A at least a turn to finish SA, and it would take a turn for SA to get on that button. A can get there in two turns, at least as fast. ↩︎
Thanks to Gurkenglas for pointing out the flaw in the initial version of this section. ↩︎
Or a π0 baseline. ↩︎
For example, building SA likely cost A some resources, which SA will have to compensate it for to balance out VAR. Also, A could build SA in a stupid place that blocks its own path, and so on. ↩︎