Overall summary post here.

I've claimed that, given a subagent:

[...] restrictions on increased power for the agent ("make sure you never have the power to increase the rewards") become restrictions on the actual policy followed for the subagent ("make sure you never increase these rewards").

But how does this work? The problem is that "never have the power" is a counterfactual statement: it doesn't matter what you do, only what you could potentially do. But "never increase" is the opposite: it matters not what you could potentially do, only what you do.

How does the subagent change this? Well, the counterfactual version is at least informally clear: the agent could, by changing its policy, increase certain rewards.

But now suppose that creates a subagent that will itself not increase the rewards, and, moreover, will interfere with if attempts to do so. Now cannot increase those rewards even if it changed its policy. The laws of the environment - which, from 's perspective, now include - prevent from ever being able to to increase the rewards, and thus fulfil 's penalty function.

This is very much the "chaining itself to the mast".

The counterfactual we'd want (but don't have)

What we want is for the counterfactual of the impact penalty to also include the subagent: the agent and subagent together definitely have the "power" to change the rewards by changing their own policies. But, from 's perspective, 's policy is a brute fact about the world, not amenable to change.

We'd like for the counterfactual to also include 's policy, but for that, we'd need to identify and define subagents - a very tricky problem.

New Comment