As it happens, I think this is a rather important topic. Failure to consider and mitigate the risk of assumptions creates both false negative (less concerning) and false positive (most concerning) risks when attempting to build aligned AI.
Newtonian: complex reactions
So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.
Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes? That's what the "complex reactions" and some of the references kind of point at, but then in the description you seem to be talking more about a specific case: Strong optimisation will always find a path if it exists, so patching some but not all paths isn't useful, and in fact could have weird counter-productive effects if the remaining paths that the strong optimisation takes are actually worse in some other ways than the ones you patched.
Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.
Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?
This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.
Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.
Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.
This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.
This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?
This post is part of the work done at Conjecture.
Nothing hampers Science and Engineering like unchecked assumptions.
As a concrete example of a field ridden with hidden premises, let's look at sociology. Sociologist must deal with the feedback of their object of study (people in social situations), their own social background, as well as the myriad of folk sociology notions floating in the memesphere. You might think that randomized surveys and statistics give you objective knowledge of the sociological world, but these tools also come with underlying assumptions — that the phenomenon under study must not depend on the fine structure of the social network, for example. In general, if you don’t realize this, you will then confidently misinterpret the results without considering the biases of your approach — as in asking kids to sort their play activities into three categories you defined in advance, and then seeing this as a “validation” of the classification.
How to avoid these mistakes? Epistemological vigilance, answer Pierre Bourdieu, Jean-Claude Chamboredon, and Jean-Claude Passeron in "Le métier de sociologue". They borrow the term from French philosopher of science Gaston Bachelard, to capture the attitude of always expliciting and questioning the assumptions behind notions, theories, models, experiments. So the naive sociologists err because they fail to maintain the restless epistemological vigilance that their field requires.
Alignment, like sociology, demands a perpetual questioning of unconscious assumptions. It’s because the alignment problem, and the way we know about it, goes against some of our most secure, obvious, and basic principles about knowledge and problem-solving. Thus we need a constant vigilance to keep them from sprouting again unnoticed and steering our work away from alignment.
In this post I thus make explicit these assumptions, and discuss why we have to be epistemologically vigilant about them.[1] Taken separately, none of these call to vigilance is specific to alignment — other fields fostered it first. What makes alignment unique is the combined undermining of all these assumptions together. Alignment researchers just can't avoid the epistemological struggle.
Here is my current list:[2]
What I'm highlighting here is the need for epistemological vigilance on all these fronts. You don't have to accept the issues, just to grapple with them. If you think that one of these assumptions does hold, that's a great topic for discussion and debate. The failure mode I'm tracking is not to debate the assumptions; it's to not even consider them, while they steer us unchecked.
Thanks to Connor Leahy and TJ for discussions on these ideas. Thanks to Connor Leahy and Sid Black for feedback on a draft.
Digging into the assumptions
Boundedness: never enough
Engineers work within bounds. When you design a bridge, a software security system, a building, or a data center, what matters are the reasonable constraints on what you need to deal with: how much force, how much compute in an attack, how much temperature variation. This leads to bounds on the range of pressures and forces one has to deal with.
Such bounds ease the design process tremendously, by removing the requirement to scale forever. As an example, most cryptographic guarantees come from assuming that the attacker is only using polynomial-time computations.[4]
Yet what happens when you don’t have bounds? Alignment is in such a state right now, without bounds on the amount of optimization that the AIs will be able to do — that is, on their ability to figure things out and change the world. Physics constrains them, but with the loosest bounds possible — not much to leverage.
Unboundedness overhauls the design space. Now you have to manage every possible amount of force/pressure/optimization. Just imagine designing a security system to resist arbitrary computable attacks; none of the known cryptographic primitives we love and use would survive such a challenge.
That being said, some fields study such unbounded problems. Distributed computing theory is one, where asynchronous systems lack any bound on how long a message takes, or on the relative speed of different processes. Theoretical computer science in general tackles unboundedness in a bunch of settings (asynchronous distributed algorithms, worst-case complexity…), because modeling the exact situations in which algorithms will be used is hard, and so computer scientists aim for the strongest possible guarantees.
Epistemological vigilance for boundedness requires that we either:
find a solution that works in the unbounded setting;
find relevant and small enough bounds on capabilities and solve for this bounded setting;
or enforce such a bound on capabilities and solve for this bounded setting.
A big failure mode here is to just assume a bound that lets you prove something, when it’s not the first step to one of the three approaches above. Because we’re not trying to find versions of the problem that are easy to solve— we’re trying to solve the problem we expect to face. It’s easy to find a nice solution for a bounded setting, and simply convince oneself that the bound will hold and you will be fine. But this is not an argument, just a wish.
Direct access: so far and yet so close
If you study fluids, their physical existence helps a lot. Similarly with heat, brains, chemical substances, institutions, and computers. Your direct access to the phenomenon you’re studying lets you probe it in myriads of ways and check for yourself whether your models and theories apply. You can even amass a lot of data before making a theory.
Last time I checked, we still lacked an actual AGI, or really any way of strongly optimizing the world to the extent we worry about in alignment. So alignment research is banned from the fertile ground of interacting with the phenomenon itself. Which sucks.
It is not at all the only field of research that suffers from this problem, though: all historical sciences (evolutionary biology, geology, archaeology...) deal with it too, because their objects of study are often past events that cannot be accessed directly, witnessed, or recreated.
Most people involved in alignment acknowledge this, even when they don't agree with the rest of this list. Indeed, lack of direct access is regularly used as an argument to delay working on AGI alignment and focus instead on current systems and capabilities. That is, waiting for actual AGI or strong optimizing systems to be developed before studying them.
The problem? This proposal fails to be vigilant about the next assumption, the ability to iterate.
Iterability: don't mess it up
One thing that surprised me when reading about the Moon missions and the Apollo program is how much stuff broke all the time. The Saturn V engines pogoed, the secondary engines blew up, seams evaporated, and metal sheets warped under the (simulated) ridiculous temperature gradients of outer space. How did they manage to send people to the Moon and back alive in these conditions? Factoring out a pinch of luck, hardcore iteration. Everything was tested in as many conditions as possible, and iterated on until it didn’t break after extensive stress-tests.[5]
This incredible power of iteration can be seen in many fields where new problems need to be solved, from space engineering to drug design. When you don't know, just try out ideas and iterate. Fail faster, right?
Yet once again, alignment can’t join in on the fun. Because massive misguided optimization of the world doesn’t lend itself to a second try. If you fail, you risk game over. So epistemological vigilance tell us to either solve the problem before running the system — before iterating — or find guarantees on safety when iterating with massive amounts of optimization (which is almost the same thing as actually solving the problem).
This “you can’t get it wrong” property doesn’t crop often in science or engineering, but we can find it in the prevention of other existential risks, like nuclear war or bio-risks; or even in climate science.
The implications for alignment should be clear: we can’t just wait for the development of AGI and related technologies, and we have to work on alignment now (be it for solving the full problem or for showing that you can iterate safely), thus grappling in full with the lack of direct access.
Relaxed ergodicity: a whole new future
Imagine you’re studying gas molecules in a box. In this case and for many other systems, the dynamics behave well enough (with ergodicity for example) to let you predict relevant properties of the future states based on a deep model of the current state. Much of Boltzmann's work in statistical mechanics is based on leveraging this ability to generalize. Even without the restriction of full ergodicity, many phenomena and systems evolve in ways predictable from the current possibilities (through some sort of expectation).
Wouldn't that be nice, says epistemological vigilance. Yet strong optimization systematically shifts probability and so turns improbable world states into probable ones.[6] Thus what we observe now, with the technology available, will probably shift in non-trivial ways that need to be understood and dealt with. Ideas like instrumental convergence are qualitative predictions on this shift.
This is not a rare case. Even in statistical mechanics, you don’t always get ergodicity or the nice relaxations; and in the social sciences, this sort of shift is the standard, even if economic theory doesn’t seem good at addressing it. More generally, there’s a similarity with what Nassim Taleb calls Extremistan: settings where one outlier can matter more than everything that happened before (like many financial bets).
Quoting Taleb, those who don’t realize they’re in Extremistan get “played for suckers”. In alignment that would translate to only studying what we have access to now, with little conceptual work on what will happen after the distribution shifts, or how it will shift. And risk destruction because we refused to follow through on all our reasons for expecting a shift.
Closedness: everything is relevant
Science thrives on reductionism. By separating one phenomenon, one effect, from the rest of the world, we gain the ability to model it, understand it, and often reinsert it into the broader picture. From physics experiments to theoretical computer science’s simplifications, through managing confounding variables in social sciences studies, such isolation is key to insight after insight in science.
On the other hand, strong optimization is the perfect example of a phenomenon that cannot be boxed (pun intended). Epistemological vigilance reminds us that the core of the alignment problem lies in the impact of optimization over the larger world, and in the ability of optimization to utilize and leverage unexpected properties of the world left out of "the box abstraction". As such, knowing which details can be safely ignored is far more fraught than might be expected.
One field with this problem jumps to mind: computer security.[7] In it, a whole class of attacks —side-channel attacks — depends on implementation and other details generally left outside of formalizations, like the power consumption of the CPU.
But really, almost all sciences and engineering disciplines have examples where isolating the phenomenon ends up distorting it or even removing it. Recall from the introduction, the use of random sampling in sociology when selecting people to survey destroys any information that could have been collected about the fine structure of the network of relationships.
Examining closedness has been a focus of much of the theoretical part of conceptual alignment, from embedded agency to John's abstraction work. That being said, this epistemological vigilance is rarer in applied alignment researchers, maybe due to the prevalence of the closed system assumption in ML. As such, it's crucial to emphasize the need for vigilance here in order to avoid overconfidence in our models and experimental results.
Newtonian: complex reactions
Newton's laws of motion provide a whole ontology for thinking about how phenomena react to change: just compute the external forces, and you get a prediction of the result. Electromagnetism and Thermodynamics leverage this ontology in productive ways; so does much of structural engineering and material science, even some productivity writers.
In alignment on the other hand, the effect of interventions and change is far more involved, raising flags for epistemological vigilance. Beyond that, strong optimization doesn't just react to intervention by being pushed around; it instead channels itself through different paths towards the same convergent results. Deception in its many forms (for example deceptive alignment from the Risks paper) is but one generator of such highly non-newtonian behaviors.
This is far more common than I initially expected. Social sciences in general suffer from this problem, as a lot of their predictions, analysis and interventions alter the underlying dynamics of the social world they’re studying. Another example is cancer research, where intervening on some but not all signaling pathways might lead to adaptations towards the remaining pathways, instead of killing the cancer.
Keeping such a Newtonian assumption without a good model of what it's abstracting away leads to overconfidence on the applicability of interventions, and on our ability to direct the system. If we want to solve the problem and not delude ourselves, we need to grapple with the subtleties of reactions to interventions, if only to argue that they can be safely ignored.
Vicious synergies
As if the situation wasn’t difficult enough, note that there's a sort of vicious synergy between different assumptions. That is, the failure of one can undermine another.
Unboundedness undermines iterability, because we can’t bound how bad a missed first try would be.
As already discussed, lack of iterability undermines direct access, because it forces us to consider the problems before getting access.
Both openness and non-newtonian undermine relaxed ergodicity, as they allow more mechanisms leading to strong probability shifts.
Is it game over then?
Where does this leave us? My goal here is not to convince you that we are doomed; instead, I want to highlight which standard assumptions of science and research require epistemological vigilance if we are to solve the actual problem concerning us.
Such explicit deconfusion has at least three benefits:
There is not much merit in solving a harder problem than what you need to solve. On the other hand, solving a simpler problem, when not in a path of attack to the actual problem, leads to inadequate solutions and overconfidence in their power. Let's hone our epistemological vigilance together, and ensure that we're moving in the best available direction.[9]
Appendix: Conjecture’s Take
This post came about from discussions within Conjecture to articulate why we think alignment is hard, and why we expect many standard ML approaches to fail. As such, our take is that each of these assumptions will break by default, and that we either need to solve the problem without them or enforce some version of them.
Note that most of what I discuss in this post has been mentioned, proposed, or presented elsewhere, be it by Eliezer, Bostrom, or later thinkers. My contribution lies in expliciting the assumptions and bringing them all together.
Obviously it is only my current best model and is bound to change. Even during the writing of this post, I split one assumption into the two last ones of the final list.
This is the assumption for which my naming and description feel furthest from the True Name of what I’m pointing at. So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.
You also need conjectures about the hardness of reversing hash functions.
Engineers also added redundancy to avoid single point of failures as much as possible, but that would have been insufficient without the improvements born of iteration.
See this post for an old-school Eliezer story-explanation (and really all of Eliezer' side of the FOOM debate).
Cue security mindset.
From this talk by Gian-Carlo Rota: “Richard Feynman was fond of giving the following advice on how to be a genius. You have to keep a dozen of your favorite problems constantly present in your mind, although by and large they will lay in a dormant state. Every time you hear or read a new trick or a new result, test it against each of your twelve problems to see whether it helps. Every once in a while there will be a hit, and people will say: "How did he do it? He must be a genius!" ”
One idea that I don't discuss in the post but which is relevant is if we find good reasons to expect the problem to be impossible. In such cases, the focus should be on articulating them, checking them, and finding the best possible ways of convincing everyone of them to stop the race to extinction.