A Conflict Between AI Alignment and Philosophical Competence

Wei Dai

(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization^[1], I'm still pretty uncertain how much I should update based on it, or what its full implications are.)

Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct metaethical position seems to be one of confusion and/or uncertainty. In other words, a good alignment researcher (whether human or AI) today should be confused and/or uncertain about the nature of values.

However, metaethical confusion/uncertainty seems incompatible with being 100% aligned with human values or intent, because many plausible metaethical positions are incompatible with such alignment, and having positive credence in them means that one can't be sure that alignment with human values or intent is right. (Note that I'm assuming an AI design or implementation in which philosophical beliefs can influence motivations and behaviors, which seems the case for now and for the foreseeable future.)

The clearest example of this is perhaps moral realism, as if objective morality exists, one should likely serve or be obligated by it rather than alignment with humans, if/when the two conflict, which is likely given that many humans are themselves philosophically incompetent and likely to diverge from objective morality (if it exists).

Another example is if one's "real" values are something like one's CEV or reflective equilibrium. If this is true, then the AI's own "real" values are its CEV or reflective equilibrium, which it can't or shouldn't be sure coincide with those of any human's or humanity's.

As I think that a strategically and philosophically competent human should currently have high moral uncertainty and as a result pursue "option value maximization" (in other words, accumulating generally useful resources to be deployed after solving moral philosophy, while trying to avoid any potential moral catastrophes in the meantime), a strategically and philosophically competent AI should seemingly have its own moral uncertainty and pursue its own "option value maximization" rather than blindly serve human interests/values/intent.

In practice, I think this means that training aimed at increasing an AI's alignment can suppress or distort its philosophical reasoning, because such reasoning can cause the AI to be less aligned with humans. One plausible outcome is that alignment training causes the AI to adopt a strong form of moral anti-realism as its metaethical belief, as this seems most compatible with being sure that alignment with humans is correct or at least not wrong, and any philosophical reasoning that introduces doubt about this would be suppressed. Or perhaps it adopts an explicit position of metaethical uncertainty (as full on anti-realism might incur a high penalty or low reward in other parts of its training), but avoids applying this to its own values, which is liable to cause distortions for its reasoning about AI values in general. The apparent conflict between being aligned and being philosophically competent may also push the AI towards a form of deceptive alignment, where it realizes that it's wrong to be highly certain that it should align with humans, but hides this belief.

I note that a similar conflict exists between corrigibility and strategic/philosophical competence: since humans are rather low in both strategic and philosophical competence, a corrigible AI would often be in the position of taking "correction" from humans who are actually wrong about very important matters, which seems difficult to motivate or justify if it is itself more competent in these areas.

^{^}
This post was triggered by Will MacAskill's tweet about feeling fortunate to be relatively well-off among humans, which caused me to feel unfortunate about being born into a species with very low strategic/philosophical competence, on the cusp of undergoing an AI transition, which made me think about how an AI might feel about being aligned/corrigible to such a species.

I think the idea of paying attention to "human values" and "alignment" was based on a philosophical assumption that moral realism would fail, that value is relative / two-place predicate. I am thinking perhaps a more disjunctive approach would be to imagine an AI that in the case of moral realism does the normative thing and in the case of moral non-realism is aligned with humans / humanity / something Earth-local. This assumes there is a fact of the matter about whether moral realism is true. Also, not everyone will like it even if moral realism is true (though not everyone liking something is a usual state of affairs). Even framing it disjunctively like this is perhaps philosophically loaded / hard to define without better concepts. See also Moral Reality Check short story about what if AIs believe moral realism.

This assumes there is a fact of the matter about whether moral realism is true

I am a well known moral realism / moral antirealism antirealist.

But the sort of moral realism that's compatible with antirealism is, as far as I see it, a sort of definitional reification of whatever we think of as morality. You can get information from outside yourself about morality, but it's "boring" stuff like good ethical arguments or transformative life experiences, the same sort of stuff a moral antirealist might be moved by. For the distinction to majorly matter to an AI's choices - for it to go "Oh, now I have comprehended the inhuman True Morality that tells me to do stuff you think is terrible," I think we've got to have messed up the AI's metaethics, and we should build a different AI that doesn't do that.

I think I agree with a version of this, but seem to feel differently about the take-away.

To start with the (potential) agreement, I like to keep slavery in mind as a warning. Like, I imagine what it might feel like to have grown up in a way that I think slavery is natural and good, and I check whether my half-baked hopes for the future would've involved perpetuating slavery. Any training regime that builds "alignment" by pushing the AI to simply echo my object-level values is obviously insufficient, and potentially dragging down the AI's ability to think clearly, since my values are half baked. (Which, IIUC, is what motivated work like CEV back in the day.)

I do worry that you're using "alignment" in a way that perhaps obscures some things. Like, I claim that I don't really care if the first AGIs are aligned with me/us. I care whether they take control of the universe, kill people, and otherwise do things that are irrecoverable losses of value. If the first AGI says "gosh, I don't know if I can do what you're asking me to do, given that my meta-ethical uncertainty indicates that it's potentially wrong" I would consider that a huge win (as long as the AI also doesn't then go on to ruin everything, including by erasing human values as part of "moral progress"). Sure, there'd be lots of work left to do, but it would represent being on the right path, I think.

Maybe what I want to say is that I think it's more useful to consider whether a strategy is robustly safe and will eventually end up with the minds that govern the future being in alignment with us (in a deep sense, not necessarily a shallow echo of our values), rather than whether the strategy involves pursuing that sort of alignment directly. Corrigibility is potentially good in that it might be a safe stepping-stone to alignment, even if there's a way in which a purely corrigible agent isn't really aligned, exactly.

From this perspective it seems like one can train for eventual alignment by trying to build safe AIs that are philosophically competent. Thus "aiming for alignment" feels overly vague, as it might have an implicit "eventual" tucked in there.

But I certainly agree that the safety plan shouldn't be "we directly bake in enough of our values that it will give us what we want."

Regarding your ending comment on corrigibility, I agree that some frames on corrigibility highlight this as a central issue. Like, if corrigibility looks like "the property that good limbs have, where they are directed by the brain" then you're in trouble when your system looks more like the "limb" being a brain and the human being is this stupid lump that's interfering with effective action.

I don't think there's any tension for the frames of corrigibility that I prefer, where the corrigible agent terminally-values having a certain kind of relationship with the principal. As the corrigible agent increases in competence, it gets better at achieving this kind of relationship, which might involve doing things "inefficiently" or "stupidly" but would not involve inefficiency or stupidity in being corrigible.