This idea has been around for some time, known as indirect normativity. The variant you describe is also my own preferred formulation at the time. For a few years it was a major motivation for studying decision theory for me, since this still needs the outer AGI to actually run the program, and ideally also yield control to that program eventually, when the program figures out its values and can slot them in as the values of the outer AGI.
This doesn't work out for several reasons. We don't actually have a way of creating the goal program. The most straightforward thing would be to use an upload, but that probably can't be done before AGIs.
If we do have a sensible human imitation, then the thing to do with it is to build an HCH that pushes the goodhart boundary of that human imitation and allows stronger optimization of the world that doesn't break down our ability to assess its value. This gives the first aligned AGI directly, without turning the world into computronium.
Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs. It's not even known what kind of thing goals are, the type signature of that program that's needed to communicate the goals from the goal program to the outer AGI.
Even if we mostly knew how to build the outer AGI that runs a goal program (though with some confusion around the notion of goals still remaining), it's unclear that there are normative goals for humanity that are goals in a sense similar to a utility function in expected utility maximization, goals for a strong optimizer. We might want to discover such goals with reflection, but that doesn't necessarily reach a conclusion, as reflection is unbounded.
More likely, there is just a sequence of increasingly accurate proxy goals with increasingly wide goodhart boundaries, instructing a mild optimizer how to act on states of the world it is able to assess. But then the outer AGI must already be a mild optimizer and not a predatory mature optimizer that ignores all boundaries of what's acceptable in pursuit of the goal it knows (in this case, the goal program).
This sets up motivation for what I currently see as valuable on the decision theory side: figuring out a principled way of doing mild optimization (there's only quantilization in this space at the moment). It should probably take something like goodhart boundary as a fundamental ingredient of its operation (it seems related to the base distribution of quantilization), the kind of thing that's traditionally missing from decision theory.
Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs.
Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism. The loss function in IBP already has the semantics "which programs should run". Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself[1].
We should be careful to prevent the inhabitants of the virtual utopia from creating unaligned AI which eats the utopia. This sounds achievable, assuming the premise that we can actually construct such programs. ↩︎
Oh wow, it's fascinating to see someone actually investigating this proposal. (I had a similar idea, but only posted it in the EA meme group).
(this post has been written for the first Refine blog post day, at the end of the week of readings, discussions, and exercises about epistemology for doing good conceptual research)
the Insulated Goal-Program idea is a framework for AI alignment which feels more potentially tractable than most other ideas i've seen.
it splits the task of building aligned AI into two parts:
the fact that the AI's goal is to run a program, whose functioning it is motivated to run without altering it, lets us design a goal-program that doesn't have to deal with an adverse optimizing superintelligence — it is insulated from the AI's choices.
(or at least, there's supposedly no reason for the AI to run long stretches of variants of that program, because of the computational cost for supposedly no gain)
one way to insulate the goal-program is to make it fully deterministic. ideally, however, we would want it to be able to receive as input the state of the world before the AI modifies the world — which it will pretty much inevitably do, destroying everything and tiling the universe with computronium dedicated to running the goal-program.
this is how this idea solves the "facebook AI destroys the world six months later" problem: the AI will run the goal-program at any cost, including turning everything that exists into computronium.
but that's okay: the point here is for us, or at least our values, to survive inside the goal-program. that is the bullet i bite to allow this idea to function: i give up on the literal physical world around us, in the hopes that we're satisfied enough with getting to determine what it is that runs on the computronium that everything is turned into.
making the goal-program able to be ran on quantum compute might allow us to resimulate earth as well as generally gain a lot more compute from the universe, especially if BQP ≠ P.
this whole framework splits the problem of aligned AI cleanly into two parts: the design of the AI-insulated goal-program, and the design of the AI whose goal will be to run said program. the goal-program's insulatedness lets us design utopias or utopia-finding-programs which don't have to deal with adverseriality from the AI, such as vaguely-friendly-NNs evaluating the quality of simulated worlds, or simulated researchers figuring out alignment with as much time as they need. i write more about goal-program design here.
it also resolves some questions of embedded agency: the goal-program is indeed smaller than the agent, so it might only need notions of embedded agency resolved for how it thinks about the outside world it's turning to computronium.