Optimiser to Oracle

It would be useful to have a mechanism for getting an AI to provide an honest answer to a specific question. This post presents such a mechanism.

The question is: what is the impact of X on the expected utility of a utility function v? Here X is some well-defined binary event. Formally, for a constant c (positive or negative), we want to query the AI as to whether Q(v,X,c) is true, where

Q(v,X,c) = {E(v|X=1) - E(v|¬X=0) > c}.

How can we do that? Well, for a start, let M(u) be a u-maximising agent. We'll assume we can feed M() any utility function to maximise. We'll further assume that M() is boxed, that it has a single output bit Y, and that we have cut any acausal trade possibilities. We'll add another binary process Z, which, if Z=1, ensure that Y/¬Y is never recorded. In actual fact, P(Z=0)≈1.

The remaining assumptions are on the operations we can perform on v (without needing any further understanding of v). We replace v by v#, defined in the following way. First of all, we make v# indifferent to worlds in which X≠Y or Z=0. Indifference to Z=0 can proceed in the usual way; indifference to X≠Y means that the value of worlds with X≠Y are changed to the value of worlds where X is set to the value of Y.

Therefore any v# maximiser will behave as if it lived in a world where X=Y (its output and the outcome of X were the same) and Z=1 (its output was never recorded). Then we add a special, one-off penalty of -c for outputting Y=1 (there is no effect for outputting Y=0). This defines v#, and we give our boxed AI v# to maximise.

So, how will M(v#) behave? Because it acts as if X=Y with certainty, it behaves as if it can determine the value of X by outputting Y. Because it acts as if Z=0 with certainty, it behaves as if the output of Y is never read, so it can determine the value of X without affecting anything else about the universe. It will now behave as if it was a v maximiser with control over X. So if it outputs Y=0, it expects to get utility E(v|X=0). If it outputs Y, it expects to get utility E(v|X=1)-c (because of the -c penalty).

Therefore (ignoring ties) M(v#) will output Y=1 iff it estimates Q(v,X,c) to be true. And, since in actual fact Z=0 almost certainly, we get to see its output, hence its conclusion. Voila!

Note: the advantage of this method is to allow us to get the truth out of an AI, without knowing how to define "truth".

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

0

0