I don't think this idea is particularly novel, but it comes up often in conversations I have, so I figured it'd be good to write it up.
How do you prevent deception from AI systems?
One obvious thing to try would be to make sure that your model never thinks deceptive thoughts. There are several ways you could go about this, to varying degrees of effectiveness. For instance, you could check for various precursors to deceptive alignment. More prosaically, you could try identifying deceptive circuits within your model, or looking at what features your model is keeping track of and identifying any suspicious ones.
I think these are pretty reasonable approaches. However, I think they fail to address failure modes like deep deceptiveness. A system can be deceptive even if no individual part looks deceptive, due to complex interactions between the system and the environment. More generally, cognition and optimization power can be externalized to the environment.
One could make the argument that focusing on more salient and dangerous failure modes like deceptive alignment makes a lot more sense. However - especially if you’re interested in approaches that work in the worst case and don’t rely on reality turning out optimistically one way or the other - you probably want approaches that prevent things from going badly at all.
So, how do you prevent any failure modes that route through deception? In an earlier post, I wrote about robust intent alignment as the solution, and one research direction I think is feasible to get there. But here I want to make a different point, about what it would look like to interface with deception in that setup.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment. So, if you wanted to interface with the general thing that is deception, you’re going to be building an interface to that composite system.
Another way of putting this is that evaluating whether an action leads to deception or not requires being able to evaluate things within the complexity class of the environment. Which brings us back to the classical difficulty of RLHF: we can’t foresee the consequences of some action very reliably in a complex environment. This means we’re unable to properly evaluate whether an action taken by something smarter than us is good or not.
If all alignment properties we cared about were in this complexity class, it may be the case that worst-case alignment is strictly intractable.
However, there are many properties that belong to the complexity class of the agent alone, such that evaluation is much more feasible. Properties describing specific mechanistic components of the model, for instance. The sub-problem of identifying explicitly deceptive circuits within a model falls under this category[1].
Another property is related to internal optimization within systems. If you buy the idea of optimization as pointing at something real in the world[2], then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it’s internal to. One example of this is the target of this optimization: what I refer to as objectives, and the focus of most of my work.
I think one way to frame the core part of the technical alignment problem is that we need to figure out how to interface with the alignment properties we care about within the systems we care about. The choice of what properties we try to build interfaces with is pretty central to that, and I think a commonly missed idea is that some properties we care about cash out in pretty different ways at the limit, and as such are more or less tractable.
This is also, by my understanding, the part of the problem that most theories of impact for current interpretability work focus on. That it only focuses on this part of the problem is one of my reservations of current interpretability work.
Even if very confused in current operationalizations. It could also be very fuzzy in realistic scenarios, with blurred boundaries between things that are structurally optimizers and things that aren’t. However, I think there’s a coherent conceptual boundary that can be drawn.
I don't think this idea is particularly novel, but it comes up often in conversations I have, so I figured it'd be good to write it up.
How do you prevent deception from AI systems?
One obvious thing to try would be to make sure that your model never thinks deceptive thoughts. There are several ways you could go about this, to varying degrees of effectiveness. For instance, you could check for various precursors to deceptive alignment. More prosaically, you could try identifying deceptive circuits within your model, or looking at what features your model is keeping track of and identifying any suspicious ones.
I think these are pretty reasonable approaches. However, I think they fail to address failure modes like deep deceptiveness. A system can be deceptive even if no individual part looks deceptive, due to complex interactions between the system and the environment. More generally, cognition and optimization power can be externalized to the environment.
One could make the argument that focusing on more salient and dangerous failure modes like deceptive alignment makes a lot more sense. However - especially if you’re interested in approaches that work in the worst case and don’t rely on reality turning out optimistically one way or the other - you probably want approaches that prevent things from going badly at all.
So, how do you prevent any failure modes that route through deception? In an earlier post, I wrote about robust intent alignment as the solution, and one research direction I think is feasible to get there. But here I want to make a different point, about what it would look like to interface with deception in that setup.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment. So, if you wanted to interface with the general thing that is deception, you’re going to be building an interface to that composite system.
Another way of putting this is that evaluating whether an action leads to deception or not requires being able to evaluate things within the complexity class of the environment. Which brings us back to the classical difficulty of RLHF: we can’t foresee the consequences of some action very reliably in a complex environment. This means we’re unable to properly evaluate whether an action taken by something smarter than us is good or not.
If all alignment properties we cared about were in this complexity class, it may be the case that worst-case alignment is strictly intractable.
However, there are many properties that belong to the complexity class of the agent alone, such that evaluation is much more feasible. Properties describing specific mechanistic components of the model, for instance. The sub-problem of identifying explicitly deceptive circuits within a model falls under this category[1].
Another property is related to internal optimization within systems. If you buy the idea of optimization as pointing at something real in the world[2], then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it’s internal to. One example of this is the target of this optimization: what I refer to as objectives, and the focus of most of my work.
I think one way to frame the core part of the technical alignment problem is that we need to figure out how to interface with the alignment properties we care about within the systems we care about. The choice of what properties we try to build interfaces with is pretty central to that, and I think a commonly missed idea is that some properties we care about cash out in pretty different ways at the limit, and as such are more or less tractable.
This is also, by my understanding, the part of the problem that most theories of impact for current interpretability work focus on. That it only focuses on this part of the problem is one of my reservations of current interpretability work.
Even if very confused in current operationalizations. It could also be very fuzzy in realistic scenarios, with blurred boundaries between things that are structurally optimizers and things that aren’t. However, I think there’s a coherent conceptual boundary that can be drawn.