Instrumental convergence makes differences in values hard to notice, so there can be abudant examples of misalignment that remain unobtrusive. The differences only become a glaring problem with enough inequality of power when coercing or outright overwriting others becomes feasible (Fnargl only reaches the coercing stage, but not overwriting stage). Thus even differences in values between humans and randomly orthogonal AGIs can seem non-threatening until they aren't, the same as differences in human values can remain irrelevant for average urban dwellers.
Alignment on values becomes crucial given overwriting-level power differences, since a thought experiment with putting a human in that position predicts a decent chance of non-doom, even though it's not a good plan on its own. Conversely, unfettered superintelligent paperclip maximizers or Fnargls don't leave survivors. And since the world is very far from an equilibrium of abundant superintelligence, there are going to be unprecedented power differentials while it's settling. This makes the common sense impression of irrelevance of misalignment on values (being masked by instrumental convergence) misleading when it comes to AGI.
Thinking this through.
There's a lot of ways in which speedrunning is like paperclip maximisation: speedrunning doesn't contribute to society and further paperclips after we've produced a certain amount become useless.
I'm still confused by the analogy though. Because seems like a lot of people may do speedrunning for fun - but maybe you see it as more about status - while paperclip production isn't fun. I think this makes a difference though, as even though we don't want our society to produce absurd amounts of paperclips, we probably do want lots of niche ways to have fun.
Competitive paperclip maximization in a controlled setting sounds like it might be fun. The important thing is that it's one thing that's fun out of many things, and variety is important.
The paperclip maximizer is a thought experiment about a hypothetical superintelligent AGI that is obsessed with maximizing paperclips. It can be modeled as a utility-theoretic agent whose utility function is proportional to the number of paperclips in the universe. The Orthogonality Thesis argues for the logical possibility of such an agent. It comes in weak and strong forms:
This raises a number of questions:
A basic reason to expect the far future to be controlled by utility-maximizing agents is that utility theory is the theory of making tradeoffs under uncertainty, and agents that make plans far into the future are likely to make tradeoffs, since tradeoffs are necessary for their plans to succeed. They will be motivated to make tradeoffs leading to controlling the universe almost regardless of what U is, as long as U can only be satisfied by pumping the distant future into a specific part of the possibility space. Whether an agents seeks to maximize paperclips, minimize entropy, or maximize the amount of positive conscious experience, it will be motivated to, in the short term, cause agents sharing its values to have more leverage over the far future. This is the basic instrumental convergence thesis.
One example of approximately utility-maximizing agents we know about are biological organisms. Biological organisms model the world and have goals with respect to the world, which are to some degree resistant to wireheading (thus constituting environmental goals). They make tradeoffs to achieve these goals, which have correlation with survival and reproduction. The goals that end up likely for biological organisms to have will be (a) somewhat likely to arise from pre-existing processes such as genetic mutation, (b) well-correlated enough with survival and reproduction that an agent optimizing for these goals will be likely to replicate more agents with similar goals. However, these goals need not be identical with inclusive fitness to be likely goals for biological organisms. Inclusive fitness itself may be too unlikely to arise as a goal from genetic mutation and so on, to be a more popular value function than proxies for it.
However, there are a number of goals and values in the human environment that are not well-correlated with inclusive fitness. These are generally parts of social systems. Some examples include capacity at games such as sports, progress in a research field such as mathematics, and maximization of profit (although, this one is at least related to inclusive fitness in a more direct way than the others). Corresponding institutions which incentivize (generally human) agents to optimize for these goals include gaming/sports leagues, academic departments, and corporations.
It is quite understandable that goals well-correlated with inclusive fitness would be popular, but why would goals that are not well-correlated with inclusive fitness also be popular? Molgbug's Fnargl thought experiment might shed some light on this:
One issue with the Fnargl thought experiment is that, even with the power of death, Fnargl may lack the power to rule the world, since he relies on humans around him for information, and those humans have incentives to deceive him. However, this is an aside; one could modify the thought experiment to give Fnargl extensive surveillance powers.
The main point is that, by monomaniacally optimizing for gold, Fnargl rationally implements processes for increasing overall resources and efficient conversion between different resources, coherent tradeoffs between different resources, and a coherent system (including legalistic aspects and so on) so as to make these tradeoffs in a rational manner. This leads to a Fnargl-ruled civilization "succeeding" in the sense of having a strong material economy, high population, high ability to win wars, and so on. Molgbug asserts that Fnargl's interests are well-aligned with ours, which is more speculative; due to convergent instrumentality, Fnargl will implement the sort of infrastructure that rational humans would implement, although the implied power competition would reduce the level of alignment.
By whatever "success" metric for civilizations we select, it is surely possible to do better than optimizing for gold, as it is possible for an organism to gain more inclusive fitness by having values that are more well-aligned with inclusive fitness. But even a goal as orthogonal to civilizational success as gold-maximization leads to a great deal of civilizational success, due to civilizational success being a convergent instrumental goal.
Moreover, the simplicity and legibility of gold-maximization simplifies coordination compared to a more complex proxy for civilizational success. A Fnargl-ocracy can evaluate decisions (such as decisions related to corporate governance) using a uniform gold-maximization standard, leading to a high degree of predictability, and simplicity in prioritization calculations.
What real-world processes resemble Fnargl-ocracy? One example is Bitcoin. Proof-of-work creates incentives for maximizing a certain kind of cryptographic puzzle-solving. The goal itself is rather orthogonal to human values, but Bitcoin nonetheless creates incentives for goals such as creating computing machinery, which are human-aligned due to convergent instrumentality (additional manufacturing of computing infrastructure can be deployed to other tasks that are more directly human-aligned).
As previously mentioned, sports and gaming are popular goals that are fairly orthogonal to human values. Sporting incentivizes humans and groups of humans to become more physically and mentally capable, leading to more generally-useful fitness practices such as weight training, and agency-related mental practices, which people can learn about by listening to sports athletes and coaches. Board games such as chess incentivize practical rationality and general understanding of rationality, including AI-related work such as the Minimax algorithm, Monte-Carlo Tree Search, and AlphaGo. Bayesian probability theory was developed in large part to analyze gambling games. Speedrunning has led to quite a lot of analysis of video games and practice at getting better at these games, by setting a uniform standard by which gameplay runs can be judged.
Academic fields, especially STEM-type fields such as mathematics, involve shared, evaluable goals that are not necessarily directly related to human values. For example, number theory is a major subfield of mathematics, and its results are rarely directly useful, though progress in number theory, such as the proof of Fermat's last theorem, is widely celebrated. Number theory does, along the way, produce more generally-useful work, such as Peano arithmetic (and proof theory more generally), Gödel's results, and cryptographic algorithms such as RSA.
Corporations are, in general, supposed to maximize profit conditional on legal compliance and so on. While profit-maximization comes apart from human values, corporations are, under conditions of rule of law, generally incentivized to produce valuable goods and services at minimal cost. This example is less like a paperclip maximizer than the previous examples, as the legal and economic system that regulates corporations has been in part designed around human values. The simplicity of the money-maximization goal, however, allows corporations to make internal decisions according to a measurable, legible standard, instead of dealing with more complex tradeoffs that could lead to inconsistent decisions (which may be "money-pumpable" as VNM violations tend to be).
Some systems are relatively more loaded on human values, and less like paperclip maximizers. Legal systems are designed and elaborated on in a way that takes human values into account, in terms of determining which behaviors are generally considered prosocial and antisocial. Legal decisions form precedents that formalize certain commitments including trade-offs between different considerations. Religions are also designed partially around human values, and religious goals tend to be aligned with self-replication, by for example encouraging followers to have children, to follow legalistic norms with respect to each other, and to spread the religion.
The degree to which commonly-shared social goals can be orthogonal to human values is still, however, striking. These goals are a kind of MacGuffin, as Zvi wrote about:
By organizing around a MacGuffin (such as speedrunning), humans can coordinate around a shared goal, and make uniform decisions around this shared goal, which leads to making consistent tradeoffs in the domain related to this goal. The MacGuffin can, like gold-maximization, be basically orthogonal to human values, and yet incentivize instrumental optimization that is convergent with that of other values, leading to human value satisfaction along the way.
Adopting a shared goal has the benefit of making it easy to share perspective with others. This can make it easier to find other people who think similarly to one's self, and develop practice coordinating with them, with performance judged on a common standard. Altruism can have this effect, since in being altruistic, individual agents "erase" their own index, sharing an agentic perspective with others; people meeting friends through effective altruism is an example of this.
It is still important, to human values, that the paperclip-maximizer-like processes are not superintelligent; while they aggregate compute and agency across many humans, they aren't nearly as strongly superintelligent as a post-takeoff AGI. Such an agent would be able to optimize its goal without the aid of humans, and would be motivated to limit humans' agency so as to avoid humans competing with it for resources. Job automation worries are, accordingly, in part the worry that existing paperclip-maximizer-like processes (such as profit-maximizing corporations) may become misaligned with human welfare as they no longer depend on humans to maximize their respective paperclips.
For superintelligent AGI to be aligned with human values, therefore, it is much more necessary for its goals to be directly aligned with human values, even more than the degree to which human values are aligned with inclusive evolutionary fitness. This requires overcoming preference falsification, and taking indexical (including selfish) goals into account.
To conclude, paperclip-maximizer-like processes arise in part because the ability to make consistent, legible tradeoffs is a force multiplier. The paperclip-maximization-like goals (MacGuffins) can come apart from both replicator-type objectives (such as inclusive fitness) and human values, although can be aligned in a non-superintelligent regime due to convergent instrumentality. It is hard to have a great deal of influence over the future without making consistent tradeoffs, and already-existing paperclip-maximizer-like systems provide examples of the power of legible utility functions. As automation becomes more powerful, it becomes more necessary, for human values, to design systems that optimize goals aligned with human values.