I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.
My life is a relentless sequence of exercises in importance sampling and counterfactual analysis
This made me laugh out loud :P
...If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?
Overall, I generally agree with the intentional stance as an
What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.
...
- (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
- That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavi
To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
Probably something like the last one, although I think "even in principle" is doing so...
(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")
All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.
I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions li...
Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, bec...
I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p
You might find Joscha Bach's view interesting...
I didn't really take the time to try and define "mesa-objective" here. My definition would be something like this: if we took long enough, we could point to places in the big NN (or whatever) which represent goal content, similarly to how we can point to reward systems (/ motivation systems) in the human brain. Messing with these would change the apparent objective of the NN, much like messing with human motivation centers.
This sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been e...
...I haven't engaged that much with the anti-EU-theory stuff, but my experience so far is that it usually involves a pretty strict idea of what is supposed to fit EU theory, and often, misunderstandings of EU theory. I have my own complaints about EU theory, but they just don't resonate at all with other people's complaints, it seems.
For example, I don't put much stock in the idea of utility functions, but I endorse a form of EU theory which avoids them. Specifically, I believe in approximately coherent expectations: you assign expected values to events, and
I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense.
Same, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.?
...These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do wha
Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)
This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective.
I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless...
...The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can't afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent's behavior under that subset of environments, and those count as valid behavioral objectives.
The key here is that the set of allowed m
which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the "same agent" in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment
I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agen...
However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans".
I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI peop...
So, for example, this claims that either intent alignment + objective robustness or outer alignment + robustness would be sufficient for impact alignment.
Shouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”?
Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.
To me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex repres... (read more)