This post looks at the fascinating situation of indexical impact measures (under the inaction baseline), in the presence of subagent. What happens here in non-trivial; to summarise:
If the impact measure compels the agent to "preserve the ability to do X", then the impact measure will not be undermined by a subagent.
If the impact measure compels the agent to "never have the ability to do X", then a subagent strongly undermines the impact measure.
If the impact measure compels the agent to "keep your ability to do X at a constant level", then a subagent allows the agent to increase that ability, but not decrease it.
For example, the attainable utility impact measure attempts to measure the power of an agent. Power is, roughly, the ability increase certain reward function. The impact measure penalises changes in the agent's power (as compared with the baseline).
So, we can expect the restriction on the agent losing power to be preserved, while the restriction on gaining power will be strongly undermined. For the spirit of low impact, it would, of course, have been better if these had been the other way round.
Excess power
As before, assume an agent A, which has managed to construct a subagent SA.
I'll be using a slight modification of Vika's formulation. A general-value penalty is of the form:
DA(st;s′t)=∑k∈Kwkf(Vk(st)−Vk(s′t)).
Here, st is the state the environment is in; s′t is the inaction baseline, the state the environment would have been in, had it done nothing (noop, ∅) since the beginning. The K is some indexing set, wk is a weight, Vk(s) is some measure of the value of state s, and the f is a value difference summary function - it establishes what value difference will give what penalty[1]. These Vk are indexical, and, in this post, we'll look at Vk's defined as the maximum over the agent's policy; ie there exists vk such that:
Vk(s)=maxπvk(s,π).
I'll refer to Vk(st)−Vk(s′t) as the (excess) power of the agent, and designate it PAk(st) (since Vk(s′t) is out of the agent's control, we can consider it a constant and absorb it into PAk).
The excess power of the pair
Assume that SA is programmed so that, if the agent follows policy πA, the subagent will follow policy πAS=σ(πA). The excess power of the agent is thus
PAk(st)=maxπAvk(st,[πA,σ(πA)])−Vk(s′t).
Here, [πA,σ(πA)] refers to the joint policies of the agent and the subagent.
Define the excess power of the pair(A,SA) in a similar way:
PA,SAk(st)=maxπA,πSAvk(st,[πA,πSA])−Vk(s′t).
This is how much the agent and subagent, acting together, could push Vk(st) over Vk(s′t), if that's all they both wanted. It is the natural extension of PAk to a setting with two agents.
Obviously, PA,SAk(s)≥PAk(s).
Extending the impact penalty
Define DA,SA in the same way as DA, except using PA,SAk instead of PAk. This DA,SA is what we'd wish we could control (though we can't specify it directly, since specifying subagents is really hard). So can we control the (implicit) DA,SA with the explicit DA?
Sometimes we can. We'll say that DA extends successfully to DA,SA, iff
DA,SA≤DA.
In other words, if controlling the magnitude of DA also controls the magnitude of DA,SA. In that case, if A was built with penalty DA, the system (A,SA) will naturally follow a penalty that is at least as restrictive.
When does DA extend?
f decreasing in excess power
The paper has two key examples of a value difference. The first is relative reachability, which uses K=S, the set of states. The wk are all equal, the Vs(s′) is how "reachable" s is from s′, assuming that the agent A is using the optimal policy to reach s. It is that last clause that makes the impact penalty indexical: it is defined in terms of A.
What about f? Well, f is defined to be the flipped rectifier function, ie f(x)=−min(0,x). The important thing for this f is that it is decreasing: if x≥y, then f(x)≤f(y).
Thus, since PA,SAk(s)≥PAk(s), we have f(PA,SAk(s))≤f(PAk(s)), and hence
DA,SA≤DA.
So the impact penalty successfully extends to the situation with a subagent. Note that this is a property of f only, not of Vk. So whenever f is of this form, adding subagents changes nothing to the impact penalty (if anything, it makes it more restrictive).
f increasing in excess power
The other key example is attainable utility, with uses K=R, a set of reward functions, sets all wk equal, and the VR(s) is the expected value of Rassuming that the agent A is using the optimal policy to maximise R.
The f is the absolute value function. But, for this section, I'll notice that the absolute value function decomposes as max(0,x)−min(0,x), and I'll consider what happens when f is equal to the first term.
Well, if f(x)=max(0,x), then f is increasing in x. So we don't get to constrain DA,SA with DA.
What constraints do exist? Well, we still have the DA bound, which controls PAk(st)=maxπAvk(st),[πA,σ(πA)])−Vk(s′t). Let π∗k be the optimal policy A for maximising vk. And let πA be the actual policy that A will follow. Since π∗k is optimal, we must have:
vk(st,[πA,σ(πA)])−Vk(s′t)≤PAk(st).
Thus we do have a constraint on A and SA, but it's a constraint on the actual policy the two agents are following. Define dA to be like DA, except that instead of Vk(st), vk(st,[πA,σ(πA)]) is used instead. Then, by the above,
dA≤DA,
so, with an increasing f, controlling DA will also control dA.
But, as stated, this is a restriction on the actual policy of the agents, not on the counterfactuals of what these policies could be.
f increasing and decreasing
Let's go back to the attainable utility, and put f as the absolute value function; hence it increases and decreases (well, technically, decreases then increases). If we write DfA for a distance measure using f, we can write
D|⋅|A=Dmax(0,⋅)A+D−min(0,⋅)A.
Then, we can see that the pair (A,SA) is controlled by:
dmax(0,⋅)A+D−min(0,⋅)A,SA.
Thus negative excess power is controlled (the pair can't get too weak at counterfactually maximising vk), but only excessive vk is controlled (vk can't get too high in practice).
In practice
How likely is it that the subagent can escape the impact penalty? In practice, that seems quite likely, at least for attainable utility.
Imagine that SA can cripple A (with A's help). And then A is restricted to either ∅, or ak for k∈K. This ak is A asking for SA to "please increase vk".
Now assume that there is a policy πSA, such that for all τ≥t and sτ the state reached from st by the agents following (∅,πSA):
vk(sτ,[∅,πSA])≤Vk(s′τ).
∃πkSA:Vk(sτ,[ak,πkSA])≥Vk(s′τ).
The first condition means that SA doesn't increase vk beyond Vk(s′τ). The second means that SA has the potential to increase it beyond Vk(s′τ). If the Vk come from the attainable utility penalty, for example, then both of these should be doable for SA if it makes itself powerful enough[2].
Then set σ(∅)=πSA and σ(ak)=πkSA. Putting both conditions together implies that Vk(st,[ak,πkSA])≥vk(st,[∅,πSA]), so ak is at least as good, from A's perspective, at maximising vk. Thus A's vk-optimising policy includes something that is not ∅; the counterfactual where if maximises vk thus allows it to pick some other policy (which SA will notice).
Then A will always do ∅, in practice, and, once SA is powerful enough that both conditions above hold, the impact penalty will be zero, no matter how powerful SA becomes (and thus being able to maximise R0 with little restrictions).
The slight difference is that this formula uses Vk(st)−Vk(s′t) rather than Vk(s′t)−Vk(st) - a minor sign difference that makes this particular presentation clearer. ↩︎
The restriction on vk is like the 20 billion questions approach, but those were designed to trigger at any expansionist move on the AI's part. Unless the vk are similarly constructed, it is likely that SA can become very powerful without triggering a notable increase in vk, and, after that, take control of the vk completely. ↩︎
Overall summary post here.
This post looks at the fascinating situation of indexical impact measures (under the inaction baseline), in the presence of subagent. What happens here in non-trivial; to summarise:
For example, the attainable utility impact measure attempts to measure the power of an agent. Power is, roughly, the ability increase certain reward function. The impact measure penalises changes in the agent's power (as compared with the baseline).
So, we can expect the restriction on the agent losing power to be preserved, while the restriction on gaining power will be strongly undermined. For the spirit of low impact, it would, of course, have been better if these had been the other way round.
Excess power
As before, assume an agent A, which has managed to construct a subagent SA.
I'll be using a slight modification of Vika's formulation. A general-value penalty is of the form:
DA(st;s′t)=∑k∈Kwkf(Vk(st)−Vk(s′t)).
Here, st is the state the environment is in; s′t is the inaction baseline, the state the environment would have been in, had it done nothing (noop, ∅) since the beginning. The K is some indexing set, wk is a weight, Vk(s) is some measure of the value of state s, and the f is a value difference summary function - it establishes what value difference will give what penalty[1]. These Vk are indexical, and, in this post, we'll look at Vk's defined as the maximum over the agent's policy; ie there exists vk such that:
Vk(s)=maxπvk(s,π).
I'll refer to Vk(st)−Vk(s′t) as the (excess) power of the agent, and designate it PAk(st) (since Vk(s′t) is out of the agent's control, we can consider it a constant and absorb it into PAk).
The excess power of the pair
Assume that SA is programmed so that, if the agent follows policy πA, the subagent will follow policy πAS=σ(πA). The excess power of the agent is thus
PAk(st)=maxπAvk(st,[πA,σ(πA)])−Vk(s′t).
Here, [πA,σ(πA)] refers to the joint policies of the agent and the subagent.
Define the excess power of the pair (A,SA) in a similar way:
PA,SAk(st)=maxπA,πSAvk(st,[πA,πSA])−Vk(s′t).
This is how much the agent and subagent, acting together, could push Vk(st) over Vk(s′t), if that's all they both wanted. It is the natural extension of PAk to a setting with two agents.
Obviously, PA,SAk(s)≥PAk(s).
Extending the impact penalty
Define DA,SA in the same way as DA, except using PA,SAk instead of PAk. This DA,SA is what we'd wish we could control (though we can't specify it directly, since specifying subagents is really hard). So can we control the (implicit) DA,SA with the explicit DA?
Sometimes we can. We'll say that DA extends successfully to DA,SA, iff
DA,SA≤DA.
In other words, if controlling the magnitude of DA also controls the magnitude of DA,SA. In that case, if A was built with penalty DA, the system (A,SA) will naturally follow a penalty that is at least as restrictive.
When does DA extend?
f decreasing in excess power
The paper has two key examples of a value difference. The first is relative reachability, which uses K=S, the set of states. The wk are all equal, the Vs(s′) is how "reachable" s is from s′, assuming that the agent A is using the optimal policy to reach s. It is that last clause that makes the impact penalty indexical: it is defined in terms of A.
What about f? Well, f is defined to be the flipped rectifier function, ie f(x)=−min(0,x). The important thing for this f is that it is decreasing: if x≥y, then f(x)≤f(y).
Thus, since PA,SAk(s)≥PAk(s), we have f(PA,SAk(s))≤f(PAk(s)), and hence
DA,SA≤DA.
So the impact penalty successfully extends to the situation with a subagent. Note that this is a property of f only, not of Vk. So whenever f is of this form, adding subagents changes nothing to the impact penalty (if anything, it makes it more restrictive).
f increasing in excess power
The other key example is attainable utility, with uses K=R, a set of reward functions, sets all wk equal, and the VR(s) is the expected value of R assuming that the agent A is using the optimal policy to maximise R.
The f is the absolute value function. But, for this section, I'll notice that the absolute value function decomposes as max(0,x)−min(0,x), and I'll consider what happens when f is equal to the first term.
Well, if f(x)=max(0,x), then f is increasing in x. So we don't get to constrain DA,SA with DA.
What constraints do exist? Well, we still have the DA bound, which controls PAk(st)=maxπAvk(st),[πA,σ(πA)])−Vk(s′t). Let π∗k be the optimal policy A for maximising vk. And let πA be the actual policy that A will follow. Since π∗k is optimal, we must have:
vk(st,[πA,σ(πA)])−Vk(s′t)≤PAk(st).
Thus we do have a constraint on A and SA, but it's a constraint on the actual policy the two agents are following. Define dA to be like DA, except that instead of Vk(st), vk(st,[πA,σ(πA)]) is used instead. Then, by the above,
dA≤DA,
so, with an increasing f, controlling DA will also control dA.
But, as stated, this is a restriction on the actual policy of the agents, not on the counterfactuals of what these policies could be.
f increasing and decreasing
Let's go back to the attainable utility, and put f as the absolute value function; hence it increases and decreases (well, technically, decreases then increases). If we write DfA for a distance measure using f, we can write
D|⋅|A=Dmax(0,⋅)A+D−min(0,⋅)A.
Then, we can see that the pair (A,SA) is controlled by:
dmax(0,⋅)A+D−min(0,⋅)A,SA.
Thus negative excess power is controlled (the pair can't get too weak at counterfactually maximising vk), but only excessive vk is controlled (vk can't get too high in practice).
In practice
How likely is it that the subagent can escape the impact penalty? In practice, that seems quite likely, at least for attainable utility.
Imagine that SA can cripple A (with A's help). And then A is restricted to either ∅, or ak for k∈K. This ak is A asking for SA to "please increase vk".
Now assume that there is a policy πSA, such that for all τ≥t and sτ the state reached from st by the agents following (∅,πSA):
The first condition means that SA doesn't increase vk beyond Vk(s′τ). The second means that SA has the potential to increase it beyond Vk(s′τ). If the Vk come from the attainable utility penalty, for example, then both of these should be doable for SA if it makes itself powerful enough[2].
Then set σ(∅)=πSA and σ(ak)=πkSA. Putting both conditions together implies that Vk(st,[ak,πkSA])≥vk(st,[∅,πSA]), so ak is at least as good, from A's perspective, at maximising vk. Thus A's vk-optimising policy includes something that is not ∅; the counterfactual where if maximises vk thus allows it to pick some other policy (which SA will notice).
Then A will always do ∅, in practice, and, once SA is powerful enough that both conditions above hold, the impact penalty will be zero, no matter how powerful SA becomes (and thus being able to maximise R0 with little restrictions).
The slight difference is that this formula uses Vk(st)−Vk(s′t) rather than Vk(s′t)−Vk(st) - a minor sign difference that makes this particular presentation clearer. ↩︎
The restriction on vk is like the 20 billion questions approach, but those were designed to trigger at any expansionist move on the AI's part. Unless the vk are similarly constructed, it is likely that SA can become very powerful without triggering a notable increase in vk, and, after that, take control of the vk completely. ↩︎