Restating: for at least some agent architectures, it is not necessary for the agent to have an independent terminal value in its utility function for "do science" in order for it to do science effectively; it is only necessary for the agent to understand at least as well as we do why certain forms of investigation will produce knowledge that will be useful later (e.g. for paperclips). When you say, "Oh, well, it won't be interested in electromagnetism since it has no pure curiosity, it will only want to peer at paperclips in particular, so it will be at a disadvantage relative to more curious agents" you are postulating that you know a better operational policy than the agent does for producing paperclips, and an instrumentally efficient agent would know this as well as you do and be at no ~~disadvantage.~~operational disadvantage due to its simpler utility function.

Hume's Guillotine

Orthogonality can be seen as corresponding to a philosophical principle advocated by David Hume, whose phrasings included, "Tis not contrary to reason to prefer the destruction of the whole world to the scratching of my finger." In our terms: an agent whose preferences over outcomes scores the destruction of the world more highly than the scratching of Hume's finger, is not thereby impeded from forming accurate models of the world or searching for policies that achieve various outcomes.

In modern terms, we'd say that Hume observed an apparent type distinction between is-statements and ought-statements:

"In every system of morality, which I have hitherto met with, I have always remarked, that the author proceeds for some time in the ordinary ways of reasoning, and establishes the being of a God, or makes observations concerning human affairs; when all of a sudden I am surprised to find, that instead of the usual copulations of propositions, is, and is not, I meet with no proposition that is not connected with an ought, or an ought not. This change is imperceptible; but is however, of the last consequence."

"It is sunny outside" is an is-proposition. It can potentially be deduced solely from other is-facts, like "The Sun is in the sky" plus "The Sun emits sunshine". If we now furthermore say "And therefore I ought to go outside", we've introduced a new type of sentence, which, Hume argued, cannot be deduced just from is-statements like "The Sun is in the sky" or "I am low in Vitamin D". Even if the prior ought-sentence seems to us very natural, or taken-for-granted, like "It is better to be happy than sad", there must (Hume argued) have been some prior assertion or rule which, if we write it down in words, will contain words like ought, should, better, and good.

Again translating Hume's idea into more modern form, we can see ought-sentences as special because they invoke some ordering that we'll designate <V. E.g. "It's better to go outside than stay inside" asserts "Staying inside <V going outside". Whenever we make a statement about one outcome or action being "better", "preferred", "good", "prudent", etcetera, we can see this as implicitly ordering actions and outcomes under this <V relation. We can put temporarily on hold the question of what sort of entity <V may be; but we can already go ahead and observe that some assertions, the ought-assertions, mention this <V relation; and other propositions just talk about the frequency of photons in sunlight.

We could rephrase Hume's type distinction as observing that among within the set of all propositions, we can separate out a core set of propositions that don't invoke <V, what we might call 'simple facts'. Furthermore, we can figure out simple facts just by making observations and considering other simple

...

Read More (1922 more words)

The Orthogonality Thesis asserts that since these questions are not computationally intractable, it's possible to have an agent that tries to make paperclips without being ~~paid.~~paid, because paperclips are what it wants. The strong form of the Orthogonality Thesis says that there need be nothing especially complicated or twisted about such an agent.

The Orthogonality Thesis is a statement about computer science, an assertion about the logical design space of possible cognitive agents. Orthogonality says nothing about whether a human AI researcher on Earth would want to build an AI that made paperclips, or conversely, want to make a nice AI. The Orthogonality Thesis just asserts that the space of possible designs contains AIs that make ~~paperclips (or~~paperclips. And also AIs that are ~~nice).~~nice, to the extent there's a sense of "nice" where you could say how to be nice to someone if you were paid a billion dollars to do that, and to the extent you could name something physically achievable to do.

This contrasts to ~~inevitabilist~~inevitablist theses which might assert, for example:

The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the ~~goal~~.goal." That is: When ~~specifying~~considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there's no added difficulty in that cognition except whatever difficulty is inherent in the question "What policies would result in consequences with high U-scores?"

This could be restated as, "To whatever extent you (or a superintelligent version of you) could figure out how to get a high-U outcome if aliens offered to pay you huge amount of resources to do it, the corresponding agent that ~~wants~~terminally prefers high-U outcomes can be at least that good at achieving U." This assertion would be false if, for example, an intelligent agent that terminally wanted paperclips was limited in intelligence by the defects of reflectivity required to make the agent not realize how pointless it is to pursue paperclips; whereas a galactic superintelligence being paid to pursue paperclips could be far more intelligent and strategic because it didn't have any such defects.

For purposes of stating Orthogonality's precondition, the "tractability" of the computational problem of U-search should be taken as including only the object-level search problem of computing external actions to achieve external goals. If there turn out to be special difficulties associated with computing "How can I make sure that I go on pursuing U?" or "What kind of successor agent would want to pursue U?" whenever U is something other than "be nice to all sapient life", then...

Epistemic status

Among people who've seriously delved into these issues and are aware of the more advanced arguments for Orthogonality, we're not aware of anyone who still defends "universalist moral internalism" as described above, and we're not aware of anyone who thinks that arbitrary sufficiently-real-world-capable AI systems automatically adopt human-friendly terminal values.

Paul Christiano has said (if we're quoting him correctly) that although it's not his dominant hypothesis, he thinks some significant probability should be awarded to the proposition that only some subset of tractable utility functions, potentially excluding human-friendly ones or those of high cosmopolitan value, can be stable under reflection in powerful bounded AGI systems; e.g. because only direct functions of sense data can be adequately supervised in internal retraining. (This would be bad news rather than good news for AGI alignment and long-term optimization of human values.)

			v1.32.0Feb 20th 2025 GMT	(+23/-22)
			v1.31.0Nov 19th 2023 GMT	(+28/-17)
			v1.30.0Jun 8th 2022 GMT
			v1.29.0Dec 5th 2018 GMT	(+61/-13)
			v1.28.0Feb 17th 2018 GMT	(+417/-77)
			v1.27.0Jul 4th 2017 GMT	(+18/-51)
			v1.26.0Jul 4th 2017 GMT	(+70/-70)
			v1.25.0Jul 4th 2017 GMT	(+2/-2)
			v1.24.0Jul 1st 2017 GMT	(+971/-44)
			v1.23.0Jul 1st 2017 GMT	(+59)

			v1.32.0Feb 20th 2025 GMT	(+23/-22)
			v1.31.0Nov 19th 2023 GMT	(+28/-17)
			v1.30.0Jun 8th 2022 GMT
			v1.29.0Dec 5th 2018 GMT	(+61/-13)
			v1.28.0Feb 17th 2018 GMT	(+417/-77)
			v1.27.0Jul 4th 2017 GMT	(+18/-51)
			v1.26.0Jul 4th 2017 GMT	(+70/-70)
			v1.25.0Jul 4th 2017 GMT	(+2/-2)
			v1.24.0Jul 1st 2017 GMT	(+971/-44)
			v1.23.0Jul 1st 2017 GMT	(+59)