This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."
I don't think I grok the DSM formalism enough to speak confidently about what it would mandate, but I think I see a (class of) decision problem where any agent (DSM or otherwise) must either pass up a certain gain, or else engage in "problematic" behavior (where "problematic" doesn't necessarily mean "untrammeled" according to the OP definition, but instead more informally means "something which doesn't help to avoid the usual pressures away from corrigibility / towards coherence"). The problem in question is essentially the inverse of the example you give in section 3.1:
Consider an agent tasked with choosing between two incomparable options A and B, and if it chooses B, it will be further presented with the option to trade B for A+, where A+ is incomparable to B but comparable (and preferable) to A.
(I've slightly modified the framing to be in terms of trades rather than going "up" or "down", but the decision tree is isomorphic.)
Here, A+ isn't in fact "strongly maximal" with respect to A and B (because it's incomparable to B), but I think I'm fairly confident in declaring that any agent which foresees the entire tree in advance, and which does not pick B at the initial node (going "down", if you want to use the original framing), is engaging in a dominated behavior—and to the extent that DSM doesn't consider this a dominated strategy, DSM's definitions aren't capturing a useful notion of what is "dominated" and what isn't.
Again, I'm not claiming this is what DSM says. You can think of me as trying to run an obvious-to-me assertion test on code which I haven't carefully inspected, to see if the result of the test looks sane. But if a (fully aware/non-myopic) DSM agent does constrain itself into picking B ("going down") in the above example, despite the prima facie incomparability of {A, A+} and {B}, then I would consider this behavior problematic once translated back into the context of real-world shutdownability, because it means the agent in question will at least in some cases act in order to influence whether the button is pressed.
(The hope behind incomplete preferences, after all, is that an agent whose preferences over world-states can be subdivided into "incomparability classes" will only ever act to improve its lot within the class of states it finds itself in to begin with, and will never act to shift—or prevent itself from being shifted—to a different incomparability class. I think the above example presents a deep obstacle to this hope, however. Very roughly speaking, if the gaps in the agent's preferences can be bridged via certain causal pathways, then a (non-myopic) agent which does not exploit these pathways to its own benefit will notice itself failing to exploit them, and self-modify to stop doing that.)
My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc.
On my understanding, the argument isn’t that your DSM agent can be made better off, but that the reason it can’t be made better off is because it is engaging in trammeling/“collusion”, and that the form of “trammeling” you’ve ruled out isn’t the useful kind.
As far as an example goes, consider a sequence of actions which, starting from an unpressed world state, routes through a pressed world state (or series of pressed world states), before eventually returning to an unpressed world state with higher utility than the initial state.
(The real-world context of such a sequence shouldn’t be too relevant to a theoretical example like this, but if you had to imagine something, you could imagine a “memory gambit”-esque stratagem, where the agent spends some time inactive in order to avoid detection, but has set things up in advance to eventually reactivate itself under more favorable circumstances. Again, the plausibility of the scenario isn’t super relevant here.)
If your proposed DSM agent passes up this action sequence on the grounds that some of the intermediate steps need to bridge between “incomparable” pressed/unpressed trajectories, then it does in fact pass up the certain gain. Conversely, if it doesn’t pass up such a sequence, then its behavior is the same as that of a set of negotiating subagents cooperating in order to form a larger macroagent.
I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.
Flagging here that I don't think the subagent framing is super important and/or necessary for "collusion" to happen. Even if the "outer" agent isn't literally built from subagents, "collusion" can still occur in the sense that it [the outer agent] can notice that its (incomplete) preferences factorize, in a way that allows it to deliberately trade particular completions of them against each other and thereby acquire more resources. The outer agent would then choose to do this for basically the same reason that a committee of subagents would: to acquire more resources for itself as a whole, without disadvantaging any of the completions under consideration.
Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:
You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways
[...]
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).
It certainly seems to me that e.g. people like Ziz have done reflection in a "goofy" way, and that being human has not particularly saved them from deriving "crazy stuff". Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it's currently plausible to me that part of the danger arises simply from "reflection on (partially) incoherent starting points" getting really crazy really fast.
(It's not yet clear to me how this intuition interfaces with my view on alignment hopes; you'd expect it to make things worse, but I actually think this is already "priced in" w.r.t. my P(doom), so explicating it like this doesn't actually move me—which is about what you'd expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.)
(EDIT: I mean, a lot of what I'm saying here is basically "CEV" might not be so "C", and I don't actually think I've ever bought that to begin with, so it really doesn't come as an update for me. Still worth making explicit though, IMO.)
If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.
but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc?
(It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)
The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.
Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.
This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today:
And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.
That is all.
(Obviously there's a kinda superficial resemblance here to the phenomenon of "calling out" somebody else; I want to state outright that this is not the intention, it's just that I saw your comment right after seeing Rohin's comment, in such a way that my memory of his remark was still salient enough that the connection jumped out at me. Since salient observations tend to fade over time, I wanted to put this down before that happened.)
In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.
You (a human) already exhibit #2-style trying. Despite this, you are not capable of "establishing a stable governance regime that regulates AI development" or "doing alignment research better than any existing human alignment researchers" (the latter is tautologically true, even).
So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described (or, indeed, most any pivotal act that we might recognize as "pivotal"). It then follows that if a system is capable enough to enact some such pivotal act, some part of that system must have been running a stronger search than the kind of search described in "#2-style trying". And if you buy Eliezer's/Nate's argument that it's the search itself that's dangerous, rather than the fact that you (maybe) wrapped up the search in an outer shell you happen to call "oracle AI" (or something), then it's not a large jump from there to "maybe the search decides 'killing all humans' rates highly according to its search criteria".
But perhaps you're conceptualizing this whole "trying" thing differently, because you go on to say:
Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).
which actually just does not parse in my native ontology. Like, in my ontology "sufficiently scaled-up reflex-like things" stop behaving reflexively. It's not that you have this abstract label "reflex-like", that you can slap onto some system, such that if you then scale that system up the label stays stuck to it indefinitely; in my model reflexiveness is a property of actions, not of systems, and if you make a system sufficiently powerful it leaves the regime where reflex-like behavior is its default. It automatically goes from #1 to #2 to #3 to #4 in the limit of sufficient scaling; this is, from my perspective, what is meant by the claim "these things exist on a continuum" (which claim it seems like you agreed with in a parallel comment thread, which simply furthers my confusion).
From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.
To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exactly like our own, except that energy is not conserved; once they understand the connection implied by Noether's theorem, this becomes an incoherent notion: you cannot remove the conservation-of-energy property without changing deep aspects of the laws of physics.
The second law of thermodynamics is similarly deep: it's actually a consequence of there being a (low-entropy) boundary condition at the beginning of the universe, but no corresponding (low-entropy) boundary condition at any future state. This asymmetry in boundary conditions is what causes entropy to appear directionally increasing--and again, once someone becomes aware of this, it is no longer possible for them to imagine living in a universe which started out in a very low-entropy state, but where the second law of thermodynamics does not hold.
In other words, thermodynamics as a "deep fundamental theory" is not merely [what you characterized as] a "powerful abstraction that is useful in a lot of domains". Thermodynamics is a logically necessary consequence of existing, more primitive notions--and the fact that (historically) we arrived at our understanding of thermodynamics via a substantially longer route (involving heat engines and the like), without noticing this deep connection until much later on, does not change the fact that grasping said deep connection allows one to see "at a glance" why the laws of thermodynamics inevitably follow.
Of course, this doesn't imply infinite certainty, but it does imply a level of certainty substantially higher than what would be assigned merely to a "powerful abstraction that is useful in a lot of domains". So the relevant question would seem to be: given my above described epistemic state, how might one convince me that the case for thermodynamics is not as airtight as I currently think it is? I think there are essentially two angles of attack: (1) convince me that the arguments for thermodynamics being a logically necessary consequence of the laws of physics are somehow flawed, or (2) convince me that the laws of physics don't have the properties I think they do.
Both of these are hard to do, however--and for good reason! And absent arguments along those lines, I don't think I am (or should be) particularly moved by [what you characterized as] philosophy-of-science-style objections about "advance predictions", "systematic biases", and the like. I think there are certain theories for which the object-level case is strong enough that it more or less screens off meta-level objections; and I think this is right, and good.
Which is to say:
The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. (emphasis mine)
I think you could argue this, yes--but the crucial point is that you have to actually argue it. You have to (1) highlight some aspect of the evolutionary paradigm, (2) point out [what appears to you to be] an important disanalogy between that aspect and [what you expect cognition to look like in] AGI, and then (3) argue that that disanalogy directly undercuts the reliability of the conclusions you would like to contest. In other words, you have to do things the "hard way"--no shortcuts.
...and the sense I got from Richard's questions in the post (as well as the arguments you made in this subthread) is one that very much smells like a shortcut is being attempted. This is why I wrote, in my other comment, that
I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.
I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.
Here's an existing Nate!comment that I find reasonably persuasive, which argues that these two things are correlated in precisely those cases where the outcome requires routing through lots of environmental complexity:
Here's an existing Nate!response to a different-but-qualitatively-similar request that, on my model, looks like it ought to be a decent answer to yours as well:
(The original discussion that generated this example was couched in terms of value alignment, but it seems to me the general form "delete all discussion pertaining to some deep insight/set of insights from the training corpus, and see if the model can generate those insights from scratch" constitutes a decent-to-good test of the model's cognitive planning ability.)
(Also, I personally think it's somewhat obvious that current models are lacking in a bunch of ways that don't nearly require the level of firepower implied by a counterexample like "go to the moon" or "generate this here deep insight from scratch", s.t. I don't think current capabilities constitute much of an update at all as far as "want-y-ness" goes, and continue to be puzzled at what exactly causes [some] LLM enthusiasts to think otherwise.)