Vanessa's presentation has already been turned into a post by carado (as part of Refine) and a video in Connall and Jack's excellent series (as part of SERI MATS). But I felt like carado's post was still too mathy and disconnected, and the video glossed over some interesting details. So my motivation was writing a summary that explained the proposal in more common Alignment language (especially so as to compare it more easily with other proposals), and without presupposing familiarity with Infra-Bayesianism (providing only the required motivating details).
Summary
Infra-Bayesianism (tentatively) provides a robust framework for talking about agent foundations, including formalizations of how to identify agents and their utility functions. PreDCA is a protocol we can have an artificial agent follow (given its top-level architecture implements Infra-Bayesianism), so that the agent uses these tools to identify who created it ("the user", a human) and infer their utility function. The implementation is strict enough to outright negate some of the usual failure modes (if the theoretical tools work as intended), but the implemented framework is general enough (including belief updating and world modeling) to allow for general intelligence.
What needs to happen for this to work
The most important requirement is that the agency framework provides the correct, common sense answers when computationally applied to real world fuzzy scenarios.
The AGI needs to be fit into a very specific top-level architecture (without drastically losing capabilities).
It needs to be the case that maximizing the utilities of a group of present humans allows for future moral development without lock-in (or that we find this lock-in ethically successful).
We need a good enough model of what constitutes a human (or different patches or theoretical breakthroughs to prevent acausal attacks).
Collaterally, the necessary further technical developments need to converge on applicability.
Key points
A common failure mode is wire-heading: the AGI tampering with the communication channel through which the human specifies preferences/rewards, or the AGI tampering with the human's preferences directly, or similar. PreDCA makes this impossible by eliminating this communication channel. But then, how can the AGI infer the human's utility function? This is possible because Infra-Bayesian Physicalism provides a birds-eye view of reality, without the AGI being privileged as a protagonist, and so the actions of the human inside reality itself are sufficient (when the AGI achieves a good enough model of reality, and identifies said human), without necessarily being directly transmitted to the AGI.
More concretely, an AGI with this framework continuously holds hypotheses about its world model. But these come in two forms. Some of them are hypotheses about the real physical world: which are the laws of Physics, and which is the contingently specific evolution of our timeline (which particles are where at what time, etc.). But others are hypotheses about computations: which programs provide which outputs when being run.
Why would we care to differentiate those? Because that allows for identifying which programs are running in the real world. More concretely, thanks to the generality of Infra-Bayesian belief update, the AGI can reason about counterfactuals[1] (although more theoretical work is still required on this). So the AGI can consider how the world would look like (given its current information) under different computational hypotheses. For instance, under the hypotheses that program P outputs 0 or 1. If, in these two different counterfactuals, the physical evolution of the world differs (if P outputs 0 then this robot moves left, but if it outputs 1 it moves right), that means program P is being run in the real world. Using this reasoning, and the inputs it's receiving, the AGI can also position itself inside its world model (without losing its un-privileged birds-eye view). This also allows to reason about which agents (programs) causally affect which. If an agent behaving differently (a program producing a different output) would lead to another agent receiving different observations (inputs), then the first has affected the second.
Some of these programs will be especially agentic or intelligent, and the framework provides a way of identifying them as well. First off, it allows for inferring which utility function is a program maximizing (if any). To do this, we check over all possible functions and see which one the actions of the agent are maximizing. But wait, won't this trivially produce an absurd utility function which is just "always do exactly what the agent is doing"? To prevent this overfitting, we discount against more complex functions (give them negative points for being too convoluted and unnatural).
For example, consider Alice has a big and varied vegetable basket, and from it picks sequentially an apple, a pomegranate, a strawberry and a tomato. The function "Pick an apple, then a pomegranate, then a strawberry, then a tomato" completely fits her data, but is very complex (contains a lot of information). Alternatively, the simple function "Pick red things" more naturally models her behavior.
Secondly, once we know the program's utility function, we can rate its agenticness or intelligence by calculating how many policies (series of actions) perform better than its policy (on that function). That is, to which extent its policy correctly optimizes its utility function.
With the above and other tools the following protocol can be implemented.
Precursor detection: A "precursor" is any agent (that is, agentic enough program) that could have prevented the AGI from existing. For instance, the human who pressed the button to start the AGI is a precursor. But many other humans and beings also are: maybe their cat also is, because it could have gnawed the power cable, or distant aliens, because they could have invaded us long ago.
Classification: From this big group of precursors, we need to select the actual agent we want to assist ("the user", the human). There are different and complementary ad hoc methods for this. For instance, we can demand that the agent have a "very close" causal effect on the AGI, in the sense that its actions can very quickly affect the agent (maybe because they are in the same room). Also, we can implement a rough model of what constitutes a human (using cognitive science), to discard all non-human agents. Other ideas exist.
Assistance: Once the agent is fixed (the human), the AGI will extrapolate their utility function from their actions (through the above method). But only from their past actions (before the AGI came into existence), so as to avoid the AGI tampering with their preferences. This can be generalized to include the weighted utility functions of a group of humans.
An acausal attacker is a hypothetical agent who simulates the whole world model of our AGI to make it believe that it lives inside its simulation. More concretely, regardless of the actual existence (in this or other universes) of these attackers, these kind of simulation hypotheses will make their way into the AGI, because some of them will be way simpler than the actual physical existence of our complex world, and the AGI selects for simpler hypotheses (by Occam's razor, trying to converge on the actual laws of Physics)[2]. There are several approaches that try to prevent this, especially through Classification.
How might this protocol solve Inner Alignment? The only way to change our AGI's actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model. So, allegedly, the only possible mesa-optimizers will take the form of acausal attackers (that is, simulation hypotheses), or at least something that can be very naturally modelled as an acausal attack (any false hypothesis about the world that changes the precursor that is chosen as the user, or a property relevant to actions maximizing their utility). And also allegedly, the methods implemented against radical acausal attacks will be sufficient to avoid this (and other less radical wrong hypotheses will be naturally dealt with by our AGI converging on the right physical world model).
How might this protocol solve Outer Alignment? As mentioned above, the AGI now has no channel to hack nor user to modify, and so has no wrong proxy to learn. We do need to ensure that the utility function extrapolation works as intended, but provided this Outer Alignment seems trivially solved.
On that note, this protocol of course takes the approach of pointing at human values (through the behavior of the user) instead of directly specifying a utility function. That's because Vanessa finds the latter way less tractable, because of corrigibility and other problems. And that's why she believes full Alignment to be easier than, for instance, the strawberry problem[3], because no pre-made pointer in the real world actually exists for this latter task (no human cares about copying strawberries), while some humans can serve as a pointer for the first.
So why does this approach seem to solve the Pointers Problem? Mainly because, when extrapolating the user's utility function, the selection for simpler functions will more naturally grapple the real physical expression of the user's values than the one including their shortcomings and biases, that don't correspond to physical reality (and so are allegedly more complex).
This is an expanded version of my answer to application problem 3 for Nate Soares and Vivek Hebbar's SERI MATS stream.
Why this post?
Vanessa's presentation has already been turned into a post by carado (as part of Refine) and a video in Connall and Jack's excellent series (as part of SERI MATS). But I felt like carado's post was still too mathy and disconnected, and the video glossed over some interesting details. So my motivation was writing a summary that explained the proposal in more common Alignment language (especially so as to compare it more easily with other proposals), and without presupposing familiarity with Infra-Bayesianism (providing only the required motivating details).
Summary
Infra-Bayesianism (tentatively) provides a robust framework for talking about agent foundations, including formalizations of how to identify agents and their utility functions. PreDCA is a protocol we can have an artificial agent follow (given its top-level architecture implements Infra-Bayesianism), so that the agent uses these tools to identify who created it ("the user", a human) and infer their utility function. The implementation is strict enough to outright negate some of the usual failure modes (if the theoretical tools work as intended), but the implemented framework is general enough (including belief updating and world modeling) to allow for general intelligence.
What needs to happen for this to work
Key points
A common failure mode is wire-heading: the AGI tampering with the communication channel through which the human specifies preferences/rewards, or the AGI tampering with the human's preferences directly, or similar. PreDCA makes this impossible by eliminating this communication channel. But then, how can the AGI infer the human's utility function? This is possible because Infra-Bayesian Physicalism provides a birds-eye view of reality, without the AGI being privileged as a protagonist, and so the actions of the human inside reality itself are sufficient (when the AGI achieves a good enough model of reality, and identifies said human), without necessarily being directly transmitted to the AGI.
More concretely, an AGI with this framework continuously holds hypotheses about its world model. But these come in two forms. Some of them are hypotheses about the real physical world: which are the laws of Physics, and which is the contingently specific evolution of our timeline (which particles are where at what time, etc.). But others are hypotheses about computations: which programs provide which outputs when being run.
Why would we care to differentiate those? Because that allows for identifying which programs are running in the real world. More concretely, thanks to the generality of Infra-Bayesian belief update, the AGI can reason about counterfactuals[1] (although more theoretical work is still required on this). So the AGI can consider how the world would look like (given its current information) under different computational hypotheses. For instance, under the hypotheses that program P outputs 0 or 1. If, in these two different counterfactuals, the physical evolution of the world differs (if P outputs 0 then this robot moves left, but if it outputs 1 it moves right), that means program P is being run in the real world. Using this reasoning, and the inputs it's receiving, the AGI can also position itself inside its world model (without losing its un-privileged birds-eye view). This also allows to reason about which agents (programs) causally affect which. If an agent behaving differently (a program producing a different output) would lead to another agent receiving different observations (inputs), then the first has affected the second.
Some of these programs will be especially agentic or intelligent, and the framework provides a way of identifying them as well. First off, it allows for inferring which utility function is a program maximizing (if any). To do this, we check over all possible functions and see which one the actions of the agent are maximizing. But wait, won't this trivially produce an absurd utility function which is just "always do exactly what the agent is doing"? To prevent this overfitting, we discount against more complex functions (give them negative points for being too convoluted and unnatural).
For example, consider Alice has a big and varied vegetable basket, and from it picks sequentially an apple, a pomegranate, a strawberry and a tomato. The function "Pick an apple, then a pomegranate, then a strawberry, then a tomato" completely fits her data, but is very complex (contains a lot of information). Alternatively, the simple function "Pick red things" more naturally models her behavior.
Secondly, once we know the program's utility function, we can rate its agenticness or intelligence by calculating how many policies (series of actions) perform better than its policy (on that function). That is, to which extent its policy correctly optimizes its utility function.
With the above and other tools the following protocol can be implemented.
An acausal attacker is a hypothetical agent who simulates the whole world model of our AGI to make it believe that it lives inside its simulation. More concretely, regardless of the actual existence (in this or other universes) of these attackers, these kind of simulation hypotheses will make their way into the AGI, because some of them will be way simpler than the actual physical existence of our complex world, and the AGI selects for simpler hypotheses (by Occam's razor, trying to converge on the actual laws of Physics)[2]. There are several approaches that try to prevent this, especially through Classification.
How might this protocol solve Inner Alignment? The only way to change our AGI's actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model. So, allegedly, the only possible mesa-optimizers will take the form of acausal attackers (that is, simulation hypotheses), or at least something that can be very naturally modelled as an acausal attack (any false hypothesis about the world that changes the precursor that is chosen as the user, or a property relevant to actions maximizing their utility). And also allegedly, the methods implemented against radical acausal attacks will be sufficient to avoid this (and other less radical wrong hypotheses will be naturally dealt with by our AGI converging on the right physical world model).
How might this protocol solve Outer Alignment? As mentioned above, the AGI now has no channel to hack nor user to modify, and so has no wrong proxy to learn. We do need to ensure that the utility function extrapolation works as intended, but provided this Outer Alignment seems trivially solved.
On that note, this protocol of course takes the approach of pointing at human values (through the behavior of the user) instead of directly specifying a utility function. That's because Vanessa finds the latter way less tractable, because of corrigibility and other problems. And that's why she believes full Alignment to be easier than, for instance, the strawberry problem[3], because no pre-made pointer in the real world actually exists for this latter task (no human cares about copying strawberries), while some humans can serve as a pointer for the first.
So why does this approach seem to solve the Pointers Problem? Mainly because, when extrapolating the user's utility function, the selection for simpler functions will more naturally grapple the real physical expression of the user's values than the one including their shortcomings and biases, that don't correspond to physical reality (and so are allegedly more complex).
This generality also prevents some problems in Embedded Agency.
This issue apparently applies to any powerful enough AGI, and PreDCA directly addresses it.
Having an AGI construct the exact cellular copy of a strawberry, and then do nothing else.