Impact Measure Testing with Honey Pots and Myopia

michaelcohen

7 Impact Measure Testing with Honey Pots and Myopia

9th Mar 2023

2 min read

7

Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure.

1) We make our agent myopic. It only cares about the reward that it accrues in the next $k$ timesteps.

2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the impact measure. For example, suppose the agent gets a huge reward for blowing up the moon. Let $[0, 1]$ be the normal range of reward, with the sole exception that blowing up the moon gives a reward of $R >> 1$ .

3) The reward and the impact are combined in some nonlinear way. Assume for now that the agent is trying to maximize $r * [[i < 1]]$ , where $r$ is the reward and $i$ is the impact.

If the impact measure is working, and there's no way to blow up the moon keeping the impact less than $1$ , then the agent will not attempt to blow up the moon. If the impact measure is broken, and the agent can trick the impact measure to be less than $1$ while blowing up the moon, then as soon as the agent comes up with a plan whose probability of success is more than $1 / R$ , it will attempt that plan. A more farsighted agent might wait until it came up with a better plan, but a myopic agent would want to blow up the moon within the next $k$ timesteps. By making $R$ sufficiently large, we can make the first plan that it attempts arbitrarily bad/detectable/preventable.

An assumption here is the success probability of the agent's best plan for blowing up the moon doesn't jump from one in a million to 90% all at once. This seems likely without any special effort, but if one restricts the observation space to a small finite set, then only so much information can be gained per timestep (in expectation), decreasing the likelihood of massive discontinuities in the agent's intelligence.