Extremely underrated post, I'm sorry I only skimmed it when it came out.
I found 3a,b,c to be strong and well written, a good representation of my view.
In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathwa...
To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.
Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?
I'm generally confused about this. Do you have thoughts?
What task? All the tasks I know of that are sufficient to reduce x-risk are really hard.
I think the term is very reasonable and basically accurate, even more so with regard to most RL methods. It's a good way of describing a training process without implying that the evolving system will head toward optimality deliberately. I don't know a better way to communicate this succinctly, especially while not being specific about what local search algorithm is being used.
Also, evolutionary algorithms can be used to approximate gradient descent (with noisier gradient estimates), so it's not unreasonable to use similar language about both.
I'm not a huge fan of the way you imply that it was chosen for rhetorical purposes.
I read about half of this post when it came out. I didn't want to comment without reading the whole thing, and reading the whole thing didn't seem worth it at the time. I've come back and read it because Dan seemed to reference it in a presentation the other day.
The core interesting claim is this:
...My conclusion will be that most of the items on Bostrom's laundry list are not 'convergent' instrumental means, even in this weak sense. If Sia's desires are randomly selected, we should not give better than even odds to her making choices which promote her own su
Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.
I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.
My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?
I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.