A putative new idea for AI control; index here.

Normal people start with simple ideas and complexify them; it seems that I start with complex ideas and then simplify them.

Partially inspired by a conversation with Tom Voice, here is the simpler version of the AIs in virtual worlds idea.

The idea is simply to physically confine the AI up to time $T$ . The AI gets a reward or utility only before time $T$ , and is motivated to care only about events before time $T$ . We turn off the AI at time $T$ , and inspect its output only afterwards.

We could try and enforce this by the usual trick of defining some event $E$ that will erase the AI's output (and the AI?) at or before time $T$ , and multiplying the AI's utility or reward by $I_{E}$ , the indicator function for erasure. Therefore, the argument goes, the AI will only care what happens in worlds where we won't see its output, thus not motivating it to manipulate us.

The weakness of imperfection

Unlike the original "virtual worlds" idea, it is not safe to spy on the AI while it's working, or at any time before time $T$ , as this may motivate us to intervene in the AI's execution before time $T$ . So this is more an oracle idea, but one that allow another class of outputs from the other two ideas: counterfactual utility estimation and the confirmed selective oracle.

Both those ideas, however, are stable under imperfect problem specification. If the AI doing counterfactual utility estimation, estimates that erasure $E$ won't be perfect, then it is motivated to manipulate us only through the least significant figures in its output --- and the degree of manipulation its willing to try can be estimated. The confirmed selective oracle is naturally motivated to manipulate us, but will only do this if this is a more likely route to success than answering honestly; if we set the problem up right, this shouldn't happen.

The approach in this post works if the assumptions are perfect, but it is not robust if they are slightly wrong. A small probability of the AI valuing post- $T$ means that it might throw in a "manipulate humans" trick into its output, if it can do this at no cost.

Of course, we can bring in other techniques to make this type of Oracle more robust --- by reducing the size of its output, erasing its run history (which we'd otherwise want to access), or specifying its reward function thus that the AI can't compute manipulations without sacrificing its standard reward. But this mares the beauty of the original approach, illustrating it isn't perfect on its own.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

1

Simpler, cruder, virtual world AIs

1

The weakness of imperfection