Predicted corrigibility: pareto improvements

Stuart_Armstrong

Corrigibility allows an agent to transition smoothly from a perfect u-maximiser to a perfect v-maximiser, without seeking to resist or cause this transition.

And it's the very perfection of the transition that could cause problems; while u-maximising, the agent will not take the slightest action to increase v, even if such actions are readily available. Nor will it 'rush' to finish its u-maximising before transitioning. It seems that there's some possibility of improvements here.

I've already attempted one way of dealing with the issue (see the pre-corriged agent idea). This is another one.

Pareto improvements allowed

Suppose that an agent with corrigible algorithm A is following utility u currently, and estimates that there are probabilities pi that it will transition to utilities vi at midnight (note that these are utility function representatives, not affine classes of equivalent utility functions). At midnight, the usual corrigibility applies, making A indifferent to that transition, making use of such terms as E(u|u→u) (the expectation of u, given that the A's utility doesn't change) and E(vi|u→vi) (the expectation of vi, given that A's utility changes to vi).

But, in the meantime, there are expectations such as E({u,v1,v2,...}). These are A's best current estimates as to what the genuine expected utility of the various utilites are, given all it knows about the world and itself. It could be more explicitly written as E({u,v1,v2,...}| A), to emphasise that these expectations are dependent on the agent's own algorithm.

Then the idea is to modify the agent's algorithm so that Pareto improvements are possible. Call this modified algorithm B. B can select actions that A would not have chosen, conditional on:

E(u|B) ≥ E(u|A) and E(Σpivi|B) ≥ E(Σpivi|A). There are two obvious ways we could define B:

B maximises u, subject to the constraints E(Σpivi|B) ≥ E(Σpivi|A). B maximises Σpivi, subject to the constraints E(u|B) ≥ E(u|A). In the first case, the agent maximises its current utility, without sacrificing its future utility. This could apply, for example, to a ruby mining agent that rushes to gets its rubies to the bank before its utility changes. In the second case, the agent maximises it future expected utility, without sacrificing its current utility. This could apply to a ruby mining agent that's soon to become a sapphire mining agent: it then starts to look around and collect some early sapphires as well.

Now, it would seem that doing this must cause it to lose some ruby mining ability. However, it is being Pareto with E("rubies in bank"|A, expected future transition), not with E("rubies in bank"|A, "A remains a ruby mining agent forever"). The difference is that A will behave as if it was maximising the second term, and so might not go to the bank to deposit its gains, before getting hit by the transition. So B can collects some early sapphires, and also goes to the bank to deposit some rubies, and thus end up ahead for both u and Σpivi.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

0

Predicted corrigibility: pareto improvements

0