I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.
But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conversing at 1000x speed. For extra fun, allow the imitated committee to send network packets and read responses; for extra extra fun, give them access to a workbench improving their own AI. I'd say this gets awfully close to a world-optimizer that could plausibly defeat the rest of humanity, if the imitator it's running on is good enough (GPT-6 or something). And there's of course no law saying it'll be friendly: you could prompt the inner humans with "you want to destroy real humanity" and watch the fireworks.
This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle.
Extremely broad, dense reward functions constrain training-compatible goal sets
Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function.
It's easier for a model to fall into an undesired training-compatible goal set[1] when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints.
There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't.
Impact on internal mesaoptimizers
The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution.
But how would that type of mesaoptimizer develop in the first place?
Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden.
The best story I've found so far goes something like this:
In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural[2], but perhaps not impossible for some form of predictor taken to the relevant extreme.
It seems critical to understand the degree to which outer constraints apply to inner learning, how different forms of training/architecture affect the development of instrumentality, and how much "space" is required for different levels of instrumentality to develop.
I expect:
I really want these expectations tested![4]
No greater coherence
If you manage to train a highly capable minimally instrumental agent of this kind, there isn't really anywhere else for the model-as-agent to go. Using the term "values" loosely, its values do not extend beyond actions-in-contexts, and those values are immediately satisfied upon outputting the appropriate action. Satisfaction of values is utterly under the agent's control and requires no environmental modification. No external intermediate steps are required. It does not have exploitable preferences. Refining itself won't resolve any lingering internal inconsistencies.
There is no greater coherence to be found for such an agent; it is complete through shallowness.
This is not the case for a model that develops internal instrumentality and misaligned mesaoptimizers, and even zero instrumentality will not prevent a simulator from simulating the more concerning kind of agents, but this is a pretty odd property for any agent to have!
I'd view this as one extreme end of a spectrum of agents, ranging from:
In other words, minimal instrumentality can be thought of as making the learned utility function as broad and densely defined as possible, such that there is simply no more room for intermediate goal-seeking behavior.
Using the term as in this post.
It assumes a mesaoptimizer was found and that it outperforms other implementations, including other mesaoptimizers, that would have more closely matched the dense output constraints.
It assumes enough space for the mesaoptimizer to learn the relevant kind of instrumentality that would operate beyond a single invocation.
It assumes the mesaoptimizer is able to learn and hold onto misaligned goals early while simultaneously having sufficient capability to realize it needs to hide that misalignment on relevant predictions.
It assumes the extra complexity implied by the capable misalignment is small enough that SGD can accidentally hop into its basin.
And so on.
I also note the conspicuous coincidence that some of the most capable architectures yet devised are low instrumentality or rely on world models which are. It would not surprise me if low instrumentality, and the training constraints/hints that it typically corresponds to, often implies relative ease of training at a particular level of capability which seems like a pretty happy accident.
I'm working on some experiments, but if you have ideas, please do those experiments too. This feels like a weird little niche that has very little empirical data available.
While you could construct such a function for an agent that ends up matching the behavior of a more restricted utility function and its accompanying intermediate behaviors, the usefulness-achieving assumption here is that the dense utility function was chosen to be not-that.