I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
But suppose I’m wrong, and timestep-dominance is always relevant.
My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.
I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I...
I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.
The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.
I think you intend 'sensitive to unused alternatives' to refer to the...
Thanks, this comment is also clarifying for me.
My guess is that a corrigibility-centric training process says 'Don't get the ice cream' is the correct completion, whereas full alignment says 'Do'. So that's an instance where the training processes for CAST and FA differ. How about DWIM? I'd guess DWIM also says 'Don't get the ice cream', and so seems like a closer match for CAST.
Corrigibility is, at its heart, a relatively simple concept compared to good alternatives.
I don't know about this, especially if obedience is part of corrigibility. In that case, it seems like the concept inherits all the complexity of human preferences. And then I'm concerned, because as you say:
When a training target is complex, we should expect the learner to be distracted by proxies and only get a shadow of what’s desired.
I think obedience is an emergent behavior of corrigibility.
In that case, I'm confused about how the process of training an agent to be corrigible differs from the process of training an agent to be fully aligned / DWIM (i.e. training the agent to always do what we want).
And that makes me confused about how the proposal addresses problems of reward misspecification, goal misgeneralization, deceptive alignment, and lack of interpretability. You say some things about gradually exposing agents to new tasks and environments (which seems sensible!), but I'm conc...
Got this on my list to read! Just in case it's easy for you to do, can you turn the whole sequence into a PDF? I'd like to print it. Let me know if that'd be a hassle, in which case I can do it myself.
Thanks, appreciate this!
It's unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn't seem like it can be with respect to the agent's subjective beliefs as this would make it even harder to impart.
I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know wheth...
Great post! Lots of cool ideas. Much to think about.
systems with incomplete preferences will tend to contract/precommit in ways which complete their preferences.
Point is: non-dominated strategy implies utility maximization.
But I still think both these claims are wrong.
And that’s because you only consider one rule for decision-making with incomplete preferences: a myopic veto rule, according to which the agent turns down a trade if the offered option is ranked lower than its current option according to one or more of the agent’s utility functions.
The myopic...
The point is: there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. The VNM Theorem doesn't say that, nor does Savage's Theorem, nor does Bolker-Jeffrey, nor do Dutch Books, nor does Cox's Theorem, nor does the Complete Class Theorem.
But suppose we instead define 'coherence theorems' as theorems which state that
...If you are not shooting yourself in the foot in sense X, we can view you as having coherence property
I think of coherence theorems loosely as things that say if an agent follows such and such principles, then we can prove it will have a certain property.
If you use this definition, then VNM (etc.) counts as a coherence theorem. But Premise 1 of the coherence argument (as I've rendered it) remains false, and so you can't use the coherence argument to get the conclusion that sufficiently-advanced artificial agents will be representable as maximizing expected utility.
I’m following previous authors in defining ‘coherence theorems’ as
theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
On that definition, there are no coherence theorems. VNM is not a coherence theorem, nor is Savage’s Theorem, nor is Bolker-Jeffrey, nor are Dutch Book Arguments, nor is Cox’s Theorem, nor is the Complete Class Theorem.
there are theorems that are relevant to the question of agent coherence
I'd have no proble...
theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
While I agree that such theorems would count as coherence theorems, I wouldn't consider this to cover most things I think of as coherence theorems, and as such is simply a bad definition.
I think of coherence theorems loosely as things that say if an agent follows such and such principles, then we can prove it will have a certain property. The usefulness comes from both...
Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn't work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:
... (read more)I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven't read your proposal yet, I don't have an opinion, but some coherence-inspired question I would be asking are: