As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)
Studying the conditions for the evolution of genuinely pr...
I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I encourage applicants to also read Quintin's Evolution is a bad analogy for AGI (which I wish more people had read, I think it's quite important). I think that evolution-based analogies ...
How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?
My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.)
Specifically, the situation I mean is the following:
My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot."
Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.
It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with ...
I share the impression that the agent foundations research agenda seemed not that important. But that point doesn't feel sufficient to argue that Eliezer's pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I'm not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it's extremely common for people to have periods where they work on ...
Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P
For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state...
Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate huma...