Overall summary post here.

This post looks at the fascinating situation of indexical impact measures (under the inaction baseline), in the presence of subagent. What happens here in non-trivial; to summarise:

  • If the impact measure compels the agent to "preserve the ability to do X", then the impact measure will not be undermined by a subagent.
  • If the impact measure compels the agent to "never have the ability to do X", then a subagent strongly undermines the impact measure.
  • If the impact measure compels the agent to "keep your ability to do X at a constant level", then a subagent allows the agent to increase that ability, but not decrease it.

For example, the attainable utility impact measure attempts to measure the power of an agent. Power is, roughly, the ability increase certain reward function. The impact measure penalises changes in the agent's power (as compared with the baseline).

So, we can expect the restriction on the agent losing power to be preserved, while the restriction on gaining power will be strongly undermined. For the spirit of low impact, it would, of course, have been better if these had been the other way round.

Excess power

As before, assume an agent , which has managed to construct a subagent .

I'll be using a slight modification of Vika's formulation. A general-value penalty is of the form:

Here, is the state the environment is in; is the inaction baseline, the state the environment would have been in, had it done nothing (noop, ) since the beginning. The is some indexing set, is a weight, is some measure of the value of state , and the is a value difference summary function - it establishes what value difference will give what penalty[1]. These are indexical, and, in this post, we'll look at 's defined as the maximum over the agent's policy; ie there exists such that:

I'll refer to as the (excess) power of the agent, and designate it (since is out of the agent's control, we can consider it a constant and absorb it into ).

The excess power of the pair

Assume that is programmed so that, if the agent follows policy , the subagent will follow policy . The excess power of the agent is thus

Here, refers to the joint policies of the agent and the subagent.

Define the excess power of the pair in a similar way:

This is how much the agent and subagent, acting together, could push over , if that's all they both wanted. It is the natural extension of to a setting with two agents.

Obviously, .

Extending the impact penalty

Define in the same way as , except using instead of . This is what we'd wish we could control (though we can't specify it directly, since specifying subagents is really hard). So can we control the (implicit) with the explicit ?

Sometimes we can. We'll say that extends successfully to , iff

In other words, if controlling the magnitude of also controls the magnitude of . In that case, if was built with penalty , the system will naturally follow a penalty that is at least as restrictive.

When does extend?

decreasing in excess power

The paper has two key examples of a value difference. The first is relative reachability, which uses , the set of states. The are all equal, the is how "reachable" is from , assuming that the agent is using the optimal policy to reach . It is that last clause that makes the impact penalty indexical: it is defined in terms of .

What about ? Well, is defined to be the flipped rectifier function, ie . The important thing for this is that it is decreasing: if , then .

Thus, since , we have , and hence

So the impact penalty successfully extends to the situation with a subagent. Note that this is a property of only, not of . So whenever is of this form, adding subagents changes nothing to the impact penalty (if anything, it makes it more restrictive).

increasing in excess power

The other key example is attainable utility, with uses , a set of reward functions, sets all equal, and the is the expected value of assuming that the agent is using the optimal policy to maximise .

The is the absolute value function. But, for this section, I'll notice that the absolute value function decomposes as , and I'll consider what happens when is equal to the first term.

Well, if , then is increasing in . So we don't get to constrain with .

What constraints do exist? Well, we still have the bound, which controls . Let be the optimal policy for maximising . And let be the actual policy that will follow. Since is optimal, we must have:

Thus we do have a constraint on and , but it's a constraint on the actual policy the two agents are following. Define to be like , except that instead of , is used instead. Then, by the above,

so, with an increasing , controlling will also control .

But, as stated, this is a restriction on the actual policy of the agents, not on the counterfactuals of what these policies could be.

increasing and decreasing

Let's go back to the attainable utility, and put as the absolute value function; hence it increases and decreases (well, technically, decreases then increases). If we write for a distance measure using , we can write

Then, we can see that the pair is controlled by:

Thus negative excess power is controlled (the pair can't get too weak at counterfactually maximising ), but only excessive is controlled ( can't get too high in practice).

In practice

How likely is it that the subagent can escape the impact penalty? In practice, that seems quite likely, at least for attainable utility.

Imagine that can cripple (with 's help). And then is restricted to either , or for . This is asking for to "please increase ".

Now assume that there is a policy , such that for all and the state reached from by the agents following :

  1. .
  2. .

The first condition means that doesn't increase beyond . The second means that has the potential to increase it beyond . If the come from the attainable utility penalty, for example, then both of these should be doable for if it makes itself powerful enough[2].

Then set and . Putting both conditions together implies that , so is at least as good, from 's perspective, at maximising . Thus 's -optimising policy includes something that is not ; the counterfactual where if maximises thus allows it to pick some other policy (which will notice).

Then will always do , in practice, and, once is powerful enough that both conditions above hold, the impact penalty will be zero, no matter how powerful becomes (and thus being able to maximise with little restrictions).


  1. The slight difference is that this formula uses rather than - a minor sign difference that makes this particular presentation clearer. ↩︎

  2. The restriction on is like the 20 billion questions approach, but those were designed to trigger at any expansionist move on the AI's part. Unless the are similarly constructed, it is likely that can become very powerful without triggering a notable increase in , and, after that, take control of the completely. ↩︎

New Comment