Double Corrigibility: better Corrigibility

Stuart_Armstrong

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let

ρ

be the AI's current expectation/probability estimator. Let

ρ^{'}

be one that it can conceivably change into. Let

C

be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that

C

is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That's because

E (C) = E (u_{0} | u_{0} \to u_{0}) - E (u_{n} | u_{n - 1} \to u_{n})

, where

u_{0}

is the AI's first utility and

u_{n}

its last. So some milder restrictions on the

u_{i}

should suffice to bound

C

Then, whenever the AI takes an action that changes

ρ

ρ^{'}

, it gets the special meta-compensatory reward

E_{ρ} (C | ρ \to ρ) - E_{ρ} (C | ρ \to ρ^{'})

This prevents it from deriving any advantage to

C

via changes to

ρ

. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current

ρ

Note that it doesn't prevent improvements to

ρ

in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action

a

might have an indirect effect on

ρ

, the "utility maximising" for a given

u

must be changed to:

E_{ρ} (u | a) + \sum_{ρ^{'}} P_{ρ} (ρ \to ρ^{'} | a) (E_{ρ} (C | ρ \to ρ) - E_{ρ} (C | ρ \to ρ^{'}))

where

P_{ρ}

is the probability estimate corresponding to

ρ

; the probability term can be rewritten as

E_{ρ} (I_{ρ \to ρ^{'}})

for

I_{ρ \to ρ^{'}}

the indicator function for

ρ \to ρ^{'}

. In fact the whole line above can be rewritten as

E_{ρ} (u | a) + E_{ρ} (E_{ρ} (C | ρ \to ρ) - E_{ρ} (C | ρ \to ρ^{'}) | a)

For this to work,

E_{ρ}

needs to be able to say sensible things about itself, and also about

E_{ρ^{'}}

, which is used to estimate

C

ρ \to ρ^{'}

If we compare this with various ways of factoring out variables, we can see that it's a case where we have a clear default,

ρ

, and are estimating deviations from that.

I think we need a different approach to corrigibility: the AI should not be merely indifferent to corrections, it should be actively motivated to seek out relevant knowledge, including corrections to its current value model. I see this as being like the multi-armed bandit problem (see http://en.wikipedia.org/wiki/Multi-armed_bandit ) -- the AI should understand that it's trying to maximize a function that it doesn't know, its programmers couldn't fully and accurately describe to it, and that it is trying to learn. The actual function is likely to be something hard-to-define/test/compute, like the averaged coherent extrapolated volition of all of humanity, or the all-time integral of (accurate, undeceived) human retrospective satisfaction (in, say, quality-adjusted life years or some similar unit) -- the AI needs to have a coherent description of what value function it's trying to learn (that may well be the hard part).

The AI needs to understand that all it has at any point in time is an approximate model of the true value function, and it needs to devote part of its effort to attempting to improving its model (using something like the scientific method and/or Bayesian reasoning and/or statistical and logical inference and/or machine learning techniques). And in order to rationally decide how much effort to allocate to improving its future model rather than acting on its current model, and how much trust to put in its current model in various different contexts, it also needs an approximate estimate of the accuracy of its current value function in different situations -- likely including concepts such as a quantification of 'I'm pretty sure that at least under under most circumstances humans don't like being killed', and likely also an estimate of the accuracy of its estimate of the accuracy of its model, and so forth.)

The AI should be aware that if you evaluate the median value of a monte-carlo ensemble of different uncertain estimated value functions over a large space of possible actions, there is a significant chance that the maximum of the median value will lie at point in the search space where the uncertainty in the estimate of the true value is large and will be due to the estimated value functions being locally inaccurate at that point rather than to that being a true maximum of the genuine value function. So before maximization, the median of the monte-carlo ensemble of value functions should be penalized by a factor related to both the estimated local uncertainty and its estimated distribution (and the uncertainty in that uncertainty, and so on, and allowing for the fact that the unknown errors in the value function are unlikely to be normally distributed -- a 'fat-tail' distribution is much more likely) and also to the magnitude of the look-elsewhere effect from the effective size of the space being searched over. In short, the AI needs to understand probability and statistics, and how they apply to its own internal models, and act rationally upon this knowledge.

Corrigibility (and the corrigibility of any agents it constructs) then becomes an inherently automatically desirable sub-goal: the more the AI can improve its model of the true value function, the better the expected future results of its optimization on the basis of its future model are likely to be. Humans, as evolved sentient and sapient beings, are clearly domain experts on how to make humans happy/satisfied. As such (genuine, uncoerced, unmanipulated) correction input from humans (especially live, awake, sane, rational, well informed, adult humans) is a high value input, which should be actively sought out; and whenever it is received, the AI's value model and estimates of its value model's accuracy should be rationally and promptly updated in light of it.

To pick a specific example, if the AI has a shutdown button, and a (live, awake, sane, rational, well informed, adult, uncoerced) human (genuinely) presses it, then the AI's value model should be updated in light of the fact 'this human now believes that the average value of the states of the world in which I shut down is higher (and I can even roughly estimate how much higher from how hard they pressed the button) than the optimized-by-me value of the states of the world in which I don't -- which implies that they believe there is a very serious flaw in my current value model that I am unaware of and they believe will be unlikely to correct'. If, once the AI has revised its value model and estimates of its accuracy in light of this new input, the AI believes they are likely to be correct, then it should shut down (how cautiously/hard it does so depending on how much/little it now trusts its own judgement).

Such an AI will value the input possible from a shutdown button, and will thus be motivated to keep it in good repair, as it would for any other sensor that could convey valuable information. Indeed, it might even choose institute regular shut-down surveys, polling its current popularity, much as human politicians do (but hopefully with more willingness to flip-flop if its action prove unpopular).

Indeed , the AI should be actively devoting a suitably chosen portion of its resources to going out and performing, for example, sociological surveys and double-blind experiments on what humans want it to do that could improve its value model (perhaps including how they feel about the current world supply of paperclips).

If the AI thinks it has discovered an action that will maximize value under its current value model but that falls in an area where it suspects its value model might be less accurate (e.g. launching an interstellar fleet of paperclip-constructing Von Neumann machines to conquer the galaxy and convert it to paperclips -- yes, humans pretty clearly like paperclips, but perhaps it's uncertain how much this would be modified by them being light-years away, and it's notable that so far humans have shown little sign of interest in doing this for themselves), then before doing expending any significant resources on or doing anything hard-to-undo towards this goal, it would clearly be an excellent idea to first try to get more solid data on how much human happiness would actually be enhanced by the existence of vast numbers of paperclips orbiting distant stars, in case its current value model is in fact inaccurate in this area.

As the AI becomes better informed, more knowledgeable about humans, and more capable, you would also start to see what could be termed 'informed corrigibility': the AI values and incorporate correction input from humans, in proportion to its estimate of their likely accuracy, and has fairly accurate models for how trustworthy input from different humans is -- for example, it weights the considered opinion of a panel of domain experts based on extensive experience and surveys (especially if it fits pretty well with its previous knowledge) higher than the nonsensical drunken babblings of crazy person, and is aware that under certain circumstances humans are not always entirely accurate at predicting what will turn out to give them the most satisfaction, and that even their retrospective estimates of satisfaction are not always entirely accurate.

One possible (and under appropriate circumstances highly desirable) outcome of this sort of AI is the possibility that the AI, having performed some attempts at learning, comes to the conclusion that its current value model is too inaccurate and too unsafe/oversimplified for it to safely exist long enough to learn to fix, and voluntarily shuts itself down without anyone even needing to hit the shutdown button, likely after first writing a 'suicide note' explaining what aspects of its model it had come to the conclusion were too poorly designed for its continued existence to be safe. This is probably the best possible failure mode for a flawed advanced AI -- one that correctly diagnoses that it own design is flawed and shuts itself off.