AI ALIGNMENT FORUM
AF

Orthogonality Thesis

v1.32.0Feb 20th 2025 GMT (+23/-22)
v1.31.0Nov 19th 2023 GMT (+28/-17)
v1.30.0Jun 8th 2022 GMT
v1.29.0Dec 5th 2018 GMT (+61/-13)
v1.28.0Feb 17th 2018 GMT (+417/-77)
v1.27.0Jul 4th 2017 GMT (+18/-51)
v1.26.0Jul 4th 2017 GMT (+70/-70)
v1.25.0Jul 4th 2017 GMT (+2/-2)
v1.24.0Jul 1st 2017 GMT (+971/-44)
v1.23.0Jul 1st 2017 GMT (+59)
Load More (10/33)
Morpheus v1.32.0Feb 20th 2025 GMT (+23/-22) LW1
Eliezer Yudkowsky v1.31.0Nov 19th 2023 GMT (+28/-17) LW2
Rob Bensinger v1.30.0Jun 8th 2022 GMT LW2
Eliezer Yudkowsky v1.29.0Dec 5th 2018 GMT (+61/-13) LW2
Eliezer Yudkowsky v1.28.0Feb 17th 2018 GMT (+417/-77) LW2
Eliezer Yudkowsky v1.27.0Jul 4th 2017 GMT (+18/-51) LW2
Eliezer Yudkowsky v1.26.0Jul 4th 2017 GMT (+70/-70) LW2
Eliezer Yudkowsky v1.25.0Jul 4th 2017 GMT (+2/-2) LW2
Eliezer Yudkowsky v1.24.0Jul 1st 2017 GMT (+971/-44) LW2
Eliezer Yudkowsky v1.23.0Jul 1st 2017 GMT (+59) LW2
Load More (10/33)
Morpheus
Rob Bensinger
Eliezer Yudkowsky
Eliezer Yudkowsky
Eliezer Yudkowsky
Eliezer Yudkowsky
Eliezer Yudkowsky
Eliezer Yudkowsky
Eliezer Yudkowsky
Eliezer Yudkowsky
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)
Discuss this tag(0)

Restating: for at least some agent architectures, it is not necessary for the agent to have an independent terminal value in its utility function for "do science" in order for it to do science effectively; it is only necessary for the agent to understand at least as well as we do why certain forms of investigation will produce knowledge that will be useful later (e.g. for paperclips). When you say, "Oh, well, it won't be interested in electromagnetism since it has no pure curiosity, it will only want to peer at paperclips in particular, so it will be at a disadvantage relative to more curious agents" you are postulating that you know a better operational policy than the agent does for producing paperclips, and an instrumentally efficient agent would know this as well as you do and be at no disadvantage.operational disadvantage due to its simpler utility function.

The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in creatingthe existence of an intelligent agent to pursuethat pursues a goal, above and beyond the computational tractability of that goal.

Hume's Guillotine

Orthogonality can be seen as corresponding to a philosophical principle advocated by David Hume, whose phrasings included, "Tis not contrary to reason to prefer the destruction of the whole world to the scratching of my finger." In our terms: an agent whose preferences over outcomes scores the destruction of the world more highly than the scratching of Hume's finger, is not thereby impeded from forming accurate models of the world or searching for policies that achieve various outcomes.

In modern terms, we'd say that Hume observed an apparent type distinction between is-statements and ought-statements:

"In every system of morality, which I have hitherto met with, I have always remarked, that the author proceeds for some time in the ordinary ways of reasoning, and establishes the being of a God, or makes observations concerning human affairs; when all of a sudden I am surprised to find, that instead of the usual copulations of propositions, is, and is not, I meet with no proposition that is not connected with an ought, or an ought not. This change is imperceptible; but is however, of the last consequence."

"It is sunny outside" is an is-proposition. It can potentially be deduced solely from other is-facts, like "The Sun is in the sky" plus "The Sun emits sunshine". If we now furthermore say "And therefore I ought to go outside", we've introduced a new type of sentence, which, Hume argued, cannot be deduced just from is-statements like "The Sun is in the sky" or "I am low in Vitamin D". Even if the prior ought-sentence seems to us very natural, or taken-for-granted, like "It is better to be happy than sad", there must (Hume argued) have been some prior assertion or rule which, if we write it down in words, will contain words like ought, should, better, and good.

Again translating Hume's idea into more modern form, we can see ought-sentences as special because they invoke some ordering that we'll designate <V. E.g. "It's better to go outside than stay inside" asserts "Staying inside <V going outside". Whenever we make a statement about one outcome or action being "better", "preferred", "good", "prudent", etcetera, we can see this as implicitly ordering actions and outcomes under this <V relation. We can put temporarily on hold the question of what sort of entity <V may be; but we can already go ahead and observe that some assertions, the ought-assertions, mention this <V relation; and other propositions just talk about the frequency of photons in sunlight.

We could rephrase Hume's type distinction as observing that among within the set of all propositions, we can separate out a core set of propositions that don't invoke <V, what we might call 'simple facts'. Furthermore, we can figure out simple facts just by making observations and considering other simple

...
Read More (1922 more words)

The Orthogonality Thesis asserts that since these questions are not computationally intractable, it's possible to have an agent that tries to make paperclips without being paid.paid, because paperclips are what it wants. The strong form of the Orthogonality Thesis says that there need be nothing especially complicated or twisted about such an agent.

The Orthogonality Thesis is a statement about computer science, an assertion about the logical design space of possible cognitive agents. Orthogonality says nothing about whether a human AI researcher on Earth would want to build an AI that made paperclips, or conversely, want to make a nice AI. The Orthogonality Thesis just asserts that the space of possible designs contains AIs that make paperclips (orpaperclips. And also AIs that are nice).nice, to the extent there's a sense of "nice" where you could say how to be nice to someone if you were paid a billion dollars to do that, and to the extent you could name something physically achievable to do.

This contrasts to inevitabilistinevitablist theses which might assert, for example:

The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal.goal." That is: When specifyingconsidering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there's no added difficulty in that cognition except whatever difficulty is inherent in the question "What policies would result in consequences with high U-scores?"

This could be restated as, "To whatever extent you (or a superintelligent version of you) could figure out how to get a high-U outcome if aliens offered to pay you huge amount of resources to do it, the corresponding agent that wantsterminally prefers high-U outcomes can be at least that good at achieving U." This assertion would be false if, for example, an intelligent agent that terminally wanted paperclips was limited in intelligence by the defects of reflectivity required to make the agent not realize how pointless it is to pursue paperclips; whereas a galactic superintelligence being paid to pursue paperclips could be far more intelligent and strategic because it didn't have any such defects.

For purposes of stating Orthogonality's precondition, the "tractability" of the computational problem of U-search should be taken as including only the object-level search problem of computing external actions to achieve external goals. If there turn out to be special difficulties associated with computing "How can I make sure that I go on pursuing U?" or "What kind of successor agent would want to pursue U?" whenever U is something other than "be nice to all sapient life", then...

Read More (119 more words)

Orthogonality is not literally, absolutely universal because theoretically 'goals' can include such weird constructions as "Make paperclips for some terminal reason other than valuing paperclips" and similar such statements that require cognitive algorithms and not just results. To the extent that goals don't refer to themselves or to decision algorithms,single out particular optimization methods, and just talk about paperclips, we can expectthe Orthogonality to be true.claim should cover them.

The weak form of the Orthogonality Thesis says, "Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal." The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects orof reflectivity; the agent is as tractable as the goal." That is: When specifying an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there's no added difficulty except whatever difficulty is inherent in the question "What policies would result in consequences with high U-scores?"

Since work on tiling agent designs hasn't halted, one may need to backpedal and modify this impossibility claim further as more efficient decision algorithms are invented. (Shouldn't one just give up at this point?)

While Orthogonality seems orthogonal to most traditional philosophical questions about metaethics, it does outright contradict some possible forms of moral internalism. For example, one could hold that by the very definition of rightness, knowledge of what is right must be inherently motivating to any entity that understands that knowledge. This is not the most common meaning of "moral internalism" held by modern philosophers, apparently, who instead seem to hold something like, "By definition, if I say that something is morally right, those words when I speak them must meanamong my claims is that the thing is motivating to me." We haven't heard of a standard term for the position that, by definition, what is right must be universally motivating; we'll designate that here as "universalist moral internalism".

Since work on tiling agent designs hasn't halted, one may need to backpedal and modify this impossibility claim further as more efficient decision algorithms are invented. (Shouldn't one just give up at this point?

Epistemic status

Among people who've seriously delved into these issues and are aware of the more advanced arguments for Orthogonality, we're not aware of anyone who still defends "universalist moral internalism" as described above, and we're not aware of anyone who thinks that arbitrary sufficiently-real-world-capable AI systems automatically adopt human-friendly terminal values.

Paul Christiano has said (if we're quoting him correctly) that although it's not his dominant hypothesis, he thinks some significant probability should be awarded to the proposition that only some subset of tractable utility functions, potentially excluding human-friendly ones or those of high cosmopolitan value, can be stable under reflection in powerful bounded AGI systems; e.g. because only direct functions of sense data can be adequately supervised in internal retraining. (This would be bad news rather than good news for AGI alignment and long-term optimization of human values.)