This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: "here's the decision tree(s), here's what DSM mandates, here's why it's untrammelled according to the OP definition, and here's why it's problematic."
I don't think I grok the DSM formalism enough to speak confidently about what it would mandate, but I think I see a (class of) decision problem where any agent (DSM or otherwise) must either pass up a certain gain, or else engage in "problematic" behavi...
My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don't think that's possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would've done according to DSM etc.
On my understanding, the argument isn’t that your DSM agent can...
I'll first flag that the results don't rely on subagents. Creating a group agent out of multiple subagents is possibly an interesting way to create an agent representable as having incomplete preferences, but this isn't the same as creating a single agent whose single preference relation happens not to satisfy completeness.
Flagging here that I don't think the subagent framing is super important and/or necessary for "collusion" to happen. Even if the "outer" agent isn't literally built from subagents, "collusion" can still occur in the sense that it [the...
Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:
You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways
[...]
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).
It certainly seems to me that e....
I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").
I think there's some validity to worrying about a future with very different values from today'...
If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.
It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.
but, bu...
The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.
Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.
I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:
I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.
It's one of those things that we'd be plainly undignified not to try.
I believe that JDP is planning to publish a post explaining his proposal in more detail soon.
This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today:
And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.
That is all.
(Obviously there's a kinda superficial resemblance here to the phenomenon of "calling out" somebody else; I want to state outright that this is ...
...In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a soluti
From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.
To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exa...
My objection is mostly fleshed out in my other comment. I'd just flag here that "In other words, you have to do things the "hard way"--no shortcuts" assigns the burden of proof in a way which I think is not usually helpful. You shouldn't believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you'll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the on...
Damn. I actually think you might have provided the first clear pointer I've seen about this form of knowledge production, why and how it works, and what could break it. There's a lot to chew on in this reply, but thanks a lot for the amazing food for thought!
(I especially like that you explained the physical points and put links that actually explain the specific implication)
And I agree (tentatively) that a lot of the epistemology of science stuff doesn't have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.
Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:
I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with comp...
Strong upvote, you're pointing at something very important here. I don't think I'm defending epistemic modesty, I think I'm defending epistemic rigour, of the sort that's valuable even if you're the only person in the world.
I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc).
Yes, this is corre...
Like, there's a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.
Could I ask you to say more about what you mean by "nontrivial predictions" in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory--as a specific candidate for a "strongly generalizing theory"--produces "non...
Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.
So, a bacteria needs a handful of different metabolic resources - most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths - e.g. it's useful to be able to turn A into B but also B into A, because sometimes the environment will have lots...
Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:
I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with comp...
[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti-endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instea...
If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?
Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you'll get deception. The solution isn't to somehow “filter out the unwanted instrumental behavio...
It still doesn't seem to me like you've sufficiently answered the objection here.
I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.
What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed...
...So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, th
Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.
I'm... confused by this framing? Specifically, this bit (as well as other bits like these)
...I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like
(Later added disclaimer: it's a good idea to add "I feel like..." before the judgment in this comment, so that you keep in mind that I'm talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Okay, so you're completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don't get how one sided this debate is, and how misrepresented it is here (and generally on the AF)
Like nobody except EY and a bunch of core ...
Eliezer Yudkowsky
Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn't scale, I'd expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.
I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:
...Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscienc
The output of this process is something people have taken to calling Son-of-CDT; the problem (insofar as we understand Son-of-CDT well enough to talk about its behavior) is that the resulting decision theory continues to neglect correlations that existed prior to self-modification.
(In your terms: Alice and Bob would only one-box in Newcomb variants where Omega based his prediction on them after they came up with their new decision theory; Newcomb variants where Omega's prediction occurred before they had their talk would still be met with two-boxing, even ...
One particular example of this phenomenon that comes to mind:
In (traditional) chess-playing software, generally moves are selected using a combination of search and evaluation, where the search is (usually) some form of minimax with alpha-beta pruning, and the evaluation function is used to assign a value estimate to leaf nodes in the tree, which are then propagated to the root to select a move.
Typically, the evaluation function is designed by humans (although recent developments have changed that second part somewhat) to reflect meaningful features of che...
If it's read moral philosophy, it should have some notion of what the words "human values" mean.
GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase "human values", since moral philosophy is written by (confused) humans, and in human-written text the phrase "human values" is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.
If there's some kind of measure of "observer weight" over the whole mathematical universe, we might be already much larger than 1/3^^^3 of it, so the total utilitarian can only gain so much.
Could you provide some intuition for this? Naively, I'd expect our "observer measure" over the space of mathematical structures to be 0.
Here's an existing Nate!comment that I find reasonably... (read more)
I don't see why you can't just ask at each point in time "Which action would maximize the expected value of X". It seems like asking once and asking many times as new things happen in reality don't have particularly different properties.
More detailed comment
Paul noted:
... (read more)