Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one. In this post, I argue that developers should consider satisfying such cheap-to-satisfy preferences as long as the AI isn’t caught behaving dangerously, if doing so doesn't degrade usefulness or substantially risk making the AI more ambitiously misaligned.
This looks like a good idea for surprisingly many reasons:
I signed an amicus brief supporting Anthropic's right to do business without governmental retaliation. As an AI expert, I attest that Anthropic's technical concerns are legitimate, and no laws were designed to protect against AI analysis of surveillance data.
Even though I work at a competing lab (Google DeepMind), I'm proud of Anthropic for taking a stand against unlawful retaliation and immoral demands.
(I speak only for myself, not my employer.)
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
I get that a lot of people use "simulation" and "computer program" as basically synonyms, but that's a bit linguistically impoverished for the hard work of analyzing distinct metaphysical hypotheses and their consequences. Consider that ideal behavior is different in
a) computer worlds that are built to mimic an existing world, in which case our "job" is be similar to whatever is "out there" that we're a simulation of, versus
b) computer worlds that are built to be a de novo home for a new kind of life or being, such as for entertainment or exploratory scien...