An unforeseen maximum of a utility function (or other preference framework) is when, e.g., you tell the AI to produce smiles, thinking that the AI will make people happy in order to produce smiles. But unforeseen by you, the AI has an alternative for making even more smiles, which is to convert all matter within reach into tiny molecular smileyfaces.
In other words, you're proposing to give the AI a goal , because you think has a maximum around some nice options But it turns out there's another option you didn't imagine, with and is not so nice.
Unforeseen maximums are argued to be a foreseeable difficulty of AGI alignment, if you try to identify nice policies by giving a simple criterion that, so far as you can see, seems like it'd be best optimized by doing nice things.
Slightly more semiformally, we could say that "unforeseen maximum" is realized as a difficulty when:
That is:
Juergen Schmidhuber of IDSIA, during the 2009 Singularity Summit, gave a talk proposing that the best and most moral utility function for an AI was the gain in compression of sensory data over time. Schmidhuber gave examples of valuable behaviors he thought this would motivate, like doing science and understanding the universe, or the construction of art and highly aesthetic objects.
Yudkowsky in Q&A suggested that this utility function would instead motivate the construction of external objects that would internally generate random cryptographic secrets, encrypt highly regular streams of 1s and 0s, and then reveal the cryptographic secrets to the AI.
Translating into the above schema:
Context disaster implies an unforeseen maximum may come as a surprise, or not show up during the development phase, because during the development phase the AI's options are restricted to some with
Indeed, the pseudo-formalization of a "type-1 context disaster" is isomorphic to the pseudoformalization of "unforeseen maximum", except that in a context disaster, and are identified with "AI's options during development" and "AI's options after a capability gain". (Instead of "Options the programmer is thinking of" and "Options the AI will consider".)
The two concepts are conceptually distinct because, e.g:
If we hadn't observed what seem like clear-cut cases of some actors in the field being blindsided by unforeseen maxima in imagination, we'd worry less about actors being blindsided by context disasters over observations.
Edge instantiation suggests that the real maxima of non- utility functions will be "strange, weird, and extreme" relative to our own -views on preferable options.
Missing the weird alternative suggests that people may psychologically fail to consider alternative agent options that are very low in because the human search function looks for high- and normal policies. In other words, that Schmidhuber didn't generate "encrypt streams of 1s or 0s and then reveal the key" because this policy was less attractive to him than "do art and science" and because it was weird.
Nearest unblocked strategy suggests that if you try to add a penalty term to exclude , the next-highest -ranking option will often be some similar alternative which still isn't nice.
Fragile_value asserts that our true criterion of goodness is narrowly peaked within the space of all achievable outcomes for a superintelligence, such that we rapidly fall off in as we move away from the peak. Complexity of value says that and its corresponding peak have high algorithmic complexity. Then the peak outcomes identified by any simple object-level will systematically fail to find . It's like trying to find a 1000-byte program which will approximately reproduce the text of Shakespeare's Hamlet; algorithmic information theory says that you just shouldn't expect to find a simple program like that.
Apple_pie_problem raises the concern that some people may have psychological trouble accepting the "But " critique even after it is pointed out, because of their ideological attachment to a noble goal (probably actually noble!) that would be even more praiseworthy if could also serve as a complete utility function for an AGI (which it unfortunately can't).
Conservatism in goal concepts can be seen as trying to directly tackle the problem of unforeseen maxima. More generally, AI approaches which work on "whitelisting conservative boundaries around approved policy spaces" instead of "search the widest possible policy space, minus some blacklisted parts".
The Task paradigm for advanced agents concentrates on trying to accomplish some single pivotal act which can be accomplished by one or more tasks of limited scope. Combined with other measures, this might make it easier to identify an adequate safe plan for accomplishing the limited-scope task, rather than needing to identify the fragile peak of within some much larger landscape. The Task AGI formulation is claimed to let us partially "narrow down" the scope of the necessary , the part of that's relevant to the task, and the searched policy space to what is only adequate. This might reduce or meliorate, though not by itself eliminate, unforeseen maxima.
Mild optimization can be seen as "not trying so hard, not shoving all the way to the maximum" - the hope is that when combined with a Task paradigm plus other measures like conservative goals and strategies, this will produce less optimization pressure toward weird edges and unforeseen maxima. (This method is not adequate on its own because an arbitrary adequate- policy may still not be high-, ceteris paribus.)
Imitation-based agents try to maximize similarity to a reference human's immediate behavior, rather than trying to optimize a utility function.
The prospect of being tripped up by unforeseen maxima, is one of the contributing motivations for giving up on hand-coded object-level utilities in favor of meta-level preference frameworks that learn a utility function or decision rule. (Again, this doesn't seem like a full solution by itself, only one ingredient to be combined with other methods. If the utility function is a big complicated learned object, that by itself is not a good reason to relax about the possibility that its maximum will be somewhere you didn't foresee, especially after a capabilities boost.)
Missing the weird alternative and the Apple_pie_problem suggest that it may be unusually difficult to explain to actors why is a difficulty of their favored utility function that allegedly implies nice policy That is, for psychological reasons, this difficulty seems unusually likely to actually trip up sponsors of AI projects or politically block progress on alignment.