As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)
Studying the conditions for the evolution of genuinely prosocial motivations seems promising to me.
By “prosocial motivations,” I mean something like “trying to be helpful and cooperative” at least in situations where this is “low cost.” (In this sense, classical utilitarians with prosocial motivations are generally safe to be around even for those of us who don’t want to be replaced by hedonium.)
We can make some interesting observations on prosocial motivations in humans:
By the last bullet point, I mean that it seems plausible that we can learn a lot about someone's character even in situations that are obviously "a test." E.g., the best venture capitalists don't often fall prey to charlatan founders. Paul Graham writes about his wife Jessica Livingston:
I'm better at some things than Jessica, and she's better at some things than me. One of the things she's best at is judging people. She's one of those rare individuals with x-ray vision for character. She can see through any kind of faker almost immediately. Her nickname within YC was the Social Radar, and this special power of hers was critical in making YC what it is. The earlier you pick startups, the more you're picking the founders. Later stage investors get to try products and look at growth numbers. At the stage where YC invests, there is often neither a product nor any numbers.
If Graham is correct about his wife's ability, this means that people with "shady character" sometimes fail in test situations specifically due to their character – which is strange because you'd expect that the rational strategy in these situation is "act as though you had good character."
In humans, "perfect psychopaths" arguably don't exist. That is, people without genuinely prosocial motivations, even when they're highly intelligent, don't behave the same as genuinely prosocial people in 99.9% of situations while saving their deceitful actions for the most high-stakes situations. Instead, it seems likely that they can't help but behave in subtly suspicious ways even in situations where they're able to guess that judges are trying to assess their character.
From the perspective of Shard Theory's approach, it seems interesting to ask "Why is this?"
My take (inspired by a lot of armchair psychology and – even worse – armchair evolutionary psychology – is the following:
In the context of training TAI systems, we could attempt to recreate these conditions and select for integrity and prosocial motivations. One difficulty here lies in recreating the right "developmental constraints" and in keeping a balance the relative capabilities between judges and to-be-evaluated agents. (Humans presumably went through an evolutionary arms race related to assessing each others' competence and character, which means that people were always surrounded by judges of similar intelligence.)
Lastly, there's a problem where, if you dial up capabilities too much, it becomes increasingly easier to "fake everything." (For the reasons Ajeya explains in her account of deceptive alignment.)
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?
My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.)
Specifically, the situation I mean is the following:
My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot."
Maybe there's a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there's a core of "good reasoning" that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.
It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete). By contrast, if the theory fits, you'll find that whenever you try to construct such a counterexample, it is just a non-central (but still valid) manifestation of the theory. Eliezer would probably say that people who are good at this sort of thinking will quickly see how the skeptics' counterexamples fall relevantly short.
---
The reason I remain a bit skeptical about Eliezer's general picture: I'm not sure if his thinking about AGI makes implicit questionable predictions about humans
For instance:
All of the above could be totally innocent matters of wanting to emphasize the thing that other commenters were missing, so they aren't necessarily indicative of overlooking certain possibilities. Still, the pattern there makes me wonder if maybe Eliezer hasn't spent a lot of time imagining what sorts of motivations humans can have that make them benign not in terms outcome-related ethics (what they want the world to look like), but relational ethics (who they want to respect or assist, what sort of role model they want to follow). It makes me wonder if it's really true that when you try to train an AI to be helpful and corrigible, the "consequentialism-wants-to-become-agenty-with-its-own-goals part" will be stronger than the "helping this person feels meaningful" part. (Leading to an agent that's consequentialist about following proper cognition rather than about other world-outcomes.)
FWIW I think I mostly share Eliezer's intuitions about the arguments where he makes them; I just feel like I lack the part of his picture that lets him discount the observation that some humans are interpersonally corrigible and not all that focused on other explicit goals, and that maybe this means corrigibility has a crisp/natural shape after all.
I share the impression that the agent foundations research agenda seemed not that important. But that point doesn't feel sufficient to argue that Eliezer's pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I'm not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it's extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don't know how exactly they nowadays think about prioritization in 2017 and earlier.
I really like Vaniver's comment further below:
For what it's worth, my sense is that EY's track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.
And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it.
I'm very far away from confident that Eliezer's pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer's pessimism is misguided – I can't comment on that. I'm just saying that based on what I've read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don't get the impression that Eliezer's pessimism is clearly unfounded.
Everyone's views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn't important or their strengths aren't very useful, they wouldn't do the work and wouldn't cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they've built in the AI industry, or past work they've done), too, if it turned out not to be.
Just to give an example on the sorts of observations that make me think Eliezer/"MIRI" could have a point:
Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P
For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.
Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we'll know by that point whether we've found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified "metaethical framework" or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn't feel obvious why something like Stuart's anti-realism isn't already close to there (though I'd say there are many open questions and I'd feel extremely unsure about how to proceed regarding for instance "2. A method for synthesising such basic preferences into a single utility function or similar object," and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don't complicate things enough to introduce large new risks.
Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
I'm not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:
Regarding the second point, there's a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human's capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator's capabilities, it's only supposed to become even better at imitating the unscaled human.
Yeah, there definitely seems to be something off about that categorization. I've thought a bit about how this stuff works in humans, particularly in this post of my moral anti-realism sequence. To give some quotes from that:
So, it seems like we don't want "perfect inner alignment," at least not if inner alignment is about accurately predicting reward and then forming the plan of doing what gives you most reward. Also, there's a concept of "lock in" or "identifying more with the long-term planning part of your brain than with the underlying needs-meeting machinery." Lock in can be dangerous (if you lock in something that isn't automatically corrigible), but it might also be dangerous not to lock in anything (because this means you don't know what other goals form later on).
Idk, the whole thing seems to me like brewing a potion in Harry Potter, except that you don't have a recipe book and there's luck involved, too. "Outer alignment," a minimally sufficient degree thereof (as in: the agent tends to gets rewards when it takes actions towards the intended goal), increases the likelihood that you get broadly pointed you in the right direction, so the intended goal maybe gets considered among things the internal planner considers reinforcing itself around / orienting itself towards. But then, whether the intended gets picked over other alternatives (instrumental requirements for general intelligence, or alien motivations the AI might initially have), who knows. Like with raising a child, sometimes they turn out the way the parents intend, sometimes not at all. There's probably a science to finding out how outcomes become more likely, but even if we could do that with human children developing into adults with fixed identities, there's then still the question of how to find analogous patterns in (brain-like) AI. Tough job.