This post works off the assumption that the first AGI comes relatively soon, and has an architecture which looks basically like EfficientZero, with a few improvements: a significantly larger world model and a significantly more capable search process. How would we align such an AGI?
Our pitch is to identify human values within the AGI’s world model, and use this to direct the policy selector through IRL (inverse reinforcement learning). We take a lot of inspiration from Vanessa’s PreDCA proposal [comment][video], as well as ideas developed in Infra-Bayesian Physicalism. We have stripped these down to what we saw as the core insights, meaning that there are significant differences between this and PreDCA. We initially arrived at this proposal by thinking about an idea similar to "retarget the search", except we’re using hard-coded search instead of learned optimizers, and doing the "identify human values" part using a mathematical definition of agents & goals.
We think that this proposal directly gets at what we view as the core of the alignment problem: pointing to human values in a way that is robust as capabilities scale. Naturally this all depends on the research outlined in the 'Research Required' section succeeding. See the last few sections to see many of the difficulties of this approach.
The most important assumption that we are making is that the agent is designed to explicitly search over actions that maximize a utility function. This is opposed to the model where an AGI is a single component trained end-to-end by RL (or self-supervised learning), and where the AGI learns its own mesa-objective (or the mesa-objectives of simulacra[1]) internally. We will lay out a concrete vision of this model, but keep in mind that the exact details don't matter much.[2]
We are also structuring this proposal to point this agent to maximize the preferences of a single human (henceforth "the operator"), as a simplifying assumption, because we think the difficulty of this proposal is getting it to work for a single human, and scaling to recognize many people's preferences and aggregate them is not too hard.
We assume that we are trying to align our model to the values of an individual human, the operator[3]. The world model contains information about the operator, and we are looking to determine "what they value".
This approach[4] aims to obtain this human’s utility function, both through reverse-engineering the human’s observed actions, and also from all information the world model has gathered thus far which informs what the human would do in hypothetical scenarios (e.g. other humans, psychology, neuroscience, moral philosophy etc.). The basic idea is to do IRL on the world model’s guess of the operator policy. We think, a priori, that humans are more likely to have simpler utility functions, so a simplicity prior is appropriate for backing out a good approximation of a human’s ‘Hypothetical True Utility Function’. The approach also assumes that humans are "intelligent", in the sense of being likely to be good at optimizing our utility function.[5] This approach doesn't assume that humans actually have an explicit utility function internally.
We start by defining a measure of intelligence, .[6] Given a utility function , the intelligence of an agent is a function of how often the agent’s policy, , acquires more expected utility than a random policy, (where we have some prior distribution over policies ). The two policies are separately evaluated according to the probability distribution over possible worlds, .
describes how much utility an agent gets according to a given utility function from following the policy in a possible world .
Then describes the expected amount of utility an agent will get by following the policy across the distribution of possible worlds , since we have uncertainty over exactly which world we are in.
Putting this together, we obtain the formula for the intelligence, of the agent above. This is given by the of the probability that a random policy performs better in expectation than the agent's policy according to the utility function . This defines our intelligence metric.
Now we use a function of this intelligence metric as a kind of “likelihood” which determines a probability distribution over possible utility functions the operator might have. We use , the Solomonoff prior, as the simplicity prior over utility functions. This ends up being:
Or, substituting our formula for , and adding some detail explained in a footnote[7]:
The is just because we have to normalize, as in Bayes' rule. Here, the denominator would be just .
This equation gives us a distribution over the likely utility functions of the operator. To calculate this, we must have access to a distribution representing computations (policies) that happen in the world model, and also "look like" the operator (according to some classifier). We also require that we can replace this policy, in a world model, with counterfactual policies .
We can think of the above equation as searching for intelligent agents in the world model, by searching over all “programs” in the world model and finding ones with a high . Prior to this, we will need to implement a simple classification to narrow down the space of possible 'operator' agents out of all agents identified.
This solves outer alignment as long as we have correctly specified the human operator in the world model, pointed to it, and then inferred (an accurate distribution over) its utility function. As the world model improves, this process should converge to a better and better approximation of the operator’s utility function, even if the operator is in many ways irrational. This is because the irrationalities will be better explained by the operator having lower , than by adding weird complexities and exceptions to the utility function. (There is, in reality, going to be a weighting between the simplicity and intelligence priors, and this will likely not be 1:1 as in the formula above. The optimal trade-off can probably be found empirically based on what works best for identifying the utility functions of simpler test agents).
We implemented a simple example of this process (with very small discrete priors over utility functions, policies, and environments, and a Cartesian agent). In practice we will need to implement an approximation to this process, which will still require a very efficient procedure for estimating expectations with respect to huge distributions over policies, environments and utility functions.
There are several ways that this proposed AGI design could be inner misaligned.
On the other hand, in this proposal we have a tool for detecting agents which we could use to check for agents present in the world model’s model of itself, which gives us a hint at a strategy for avoiding dangerous inner optimizers. We haven’t thought this strategy through properly yet.
We have some hope that we can avoid powerful mesa-optimizers by restricting the computation allowed in learned components (world model and any heuristics in the action selector). This may allow the overall algorithm to be aligned for long enough that it can develop its own anti-inner-misalignment strategies. To rely on this, we would want something like an upper bound on the potential “optimization power” that could be implemented by a given ML model.
The second risk is a big one, because it doesn't seem plausible to remove it by limiting the capacity of the world model. However, two possible inner-misalignments could occur in the world model:
The first case would be a major problem for this proposal (and any prosaic alignment proposal), but we're uncertain about how likely it is to happen in practice. We think the second case is more likely, and that robustly classifying the operator in the world model prevents this from being a problem (as long as potential malign agents aren't influencing the world model's model of itself).
Simulacra refers to a framing of the behaviour of large language models, in which they use simulated agentic behaviour with simulated goals to maximize predictive accuracy (e.g. when prompting GPT-3 to write a poem in the style of Robert Frost, it deploys a 'Robert Frost simulacra').
See for comparison Steve Byrnes' model, Yann LeCun's 'A path towards Autonomous Machine Intelligence.'
In practice, we probably won't want to do this, and instead have a set of operators to point at, and then follow some aggregation of their utility functions.
Again we should emphasize that this is heavily influenced by Vanessa’s IBP and PreDCA, and then changed by us into a different, much less rigorous proposal which lacks a lot of the important formalization Vanessa uses. We think this is worth doing because we are better able to analyze, understand and criticize our version.
Approximately equivalent to taking assumptions 1 and 2a from On the Feasibility of Learning Biases for Reward Inference.
From Definition 1.6 in IBP.
One way of justifying these equations is to think of the whole "Utiliscope" as doing inference on a Bayes Net.
If we assume that we have a distribution over operator policies, and we know that the operator is more likely to have higher intelligence , we can infer a distribution over the utility function of the operator. This can be thought of as inference on the following Bayesian network:
Each node is a random variable, where red means we have evidence about this variable, and blue is for latent variables. Each arrow is a functional relationship that we have prior knowledge about. We need to infer U, given our information about WM and .
is a deterministic function of and , and is the intelligence of the agent. We assume that we have a noisy observation about , called .
Writing out the equations for inferring the distribution over , we get:
If we assume that the likelihood of is exponential in the level of intelligence (strong prior knowledge that the operator has high ), we get:
and the prior over utility functions is a Solomonoff prior, then this becomes:
Realistically we would probably want the likelihood to increase fast for higher , but decrease after a certain point, to encode the knowledge that humans are intelligent agents, but not perfect agents.
Specifically, the agent's beliefs take the form of a homogeneous ultra-contribution over , your joint belief over computations and physics. The bridge transform, essentially, just checks if the agent's knowledge is consistent with a computation outputting something else (given a specific hypothesis over the universe). If it is consistent for the computation to output something else, then the computation didn't run, otherwise it did. See Infra-Bayesian Physicalism: a formal theory of naturalized induction for more details.
The grain-of-truth problem shouldn't apply here, because we are not modeling the operator as optimal. This is what the intelligence measure gets around. There is still a problem as to priors.
The Legg-Hutter measure of intelligence (p23) does this, and it appears to be Vanessa's intention.