(thanks to Linda Linsefors, Artaxerxes, and jono for their feedback on drafts of this post)

as part of the refine programme, i'm trying to figure out what's actually important to work on, with regards to AI risk. in the course of that, and motivated by some other factors such as increasingly wanting to write on my skepticism regarding ML-related approaches, i'm making this post which explains my current outlook on what the problem is, what the solutionspace we have so far looks like, and what i think should be focused on.

in this post, i won't argue for the significance of AI risks; i'll merely explain my current view of those risks, and my arguing will be regarding how i currently think those risks are to be addressed.

note that i'll be describing my own outlook, which is not the consensus even within the alignment community — arguaby, there is in fact no such consensus regarding many of the points i discuss here.

what is the problem?

i believe, akin to the yudkowsky-moore law of mad science, that the amount of resources it takes for the world to be destroyed — whether on purpose or by accident — keeps decreasing.

my most likely scenario for how this could happen is as follows: pretty soon (probly this decade or the next), an artificial intelligence capable of undergoing recursive self-improvement (RSI) until it becomes a singleton, and at that point the fate of at least the entire future lightcone will be determined by the goals of that AI.

the values we want are a very narrow target and we currently have no solid idea how to do alignment, so when AI does take over everything we're probly going to die. or worse, if for example we botch alignment.

(there are some other scenarios where a dumber AI helps cause the destruction of the world first — for example, someone decides to just let an AI try to print whatever molecules it wants in the hope of getting something interesting, and the AI makes grey goo or a superplague. or we get something like the flash war, akin to the flash crash. but i have reasons to think the RSI scenario is the most likely (see also intelligence explosion microeconomics). i also believe that multipolar scenarios are pretty unlikely: the increased capability along the way just get a lot more chances for RSI to happen along the way.)

i call the task of addressing this "AI risk mitigation"; calling it "solving alignment" rules out non-alignment solutions, and "AI risk" is broad enough to encompass not just existential risk but also suffering risks, which is in my opinion an even greater cause for concern.

there may be other ways for us to be doomed, but AI seems like the largest risk factor; on top of that, if we do eventually figure out full alignment (a term i'll explain later), then we'll have a fully aligned AI which will give us utopia and take care of all other risks for us. so while full alignment might not be the only solution to this problem, if we do get it then we'll have solved pretty much all other problems.

AI risk mitigation solutions

there are roughly three ways people have thought of doing AI risk mitigation:

fully aligned singleton (FAS)

one approach is to create an AI which is powerful enough to become a singleton (an AI with enough power to make sure its goals are pursued, without anyone being able to do anything about it) but which is also fully aligned (ideally, continuously aligned) which takes care of all other problems in one fell swoop. the difficulty of this issue is that we have to build a FAS faster than others are building existentially-or-worse risky AI, even though the task of building a FAS is expected to be harder.

by "fully aligned" i mean "so robustly aligned that with sufficient power such an AI would reliably save the world and create utopia", as opposed to "mildly aligned" systems which might experience a "sharp left turn".

despite its technical difficulty, this is the approach i believe the most in because the other two seem overall harder, as i'll argue below.

sponge coordination

in agi ruin, yudkowsky says:

We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so. I've also in the past called this the 'safe-but-useless' tradeoff, or 'safe-vs-useful'. People keep on going "why don't we only use AIs to do X, that seems safe" and the answer is almost always either "doing X in fact takes very powerful cognition that is not passively safe" or, even more commonly, "because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later". If all you need is an object that doesn't do dangerous things, you could try a sponge; a sponge is very passively safe. Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.

what i call "sponge coordination" is getting everyone who's working on AI to only build systems that are weak and safe just like a sponge, instead of building powerful AIs that take over or destroy everything. typically this is accomplished either voluntarily or through sufficiently enforced regulation

getting everyone to stop working on AI could count as a particular case of this; but in general, the reason we'd want to tell AI companies "please use this to make your AI safe" rather than "please stop making AI" is that the former might still allow them to make some profit.

it might be the case that for some AI capability organizations, they take AI risk seriously enough to be willing to forego some percieved expected gains such that they're willing to spend effort making their AI systems weaker or safer, but they don't take AI risk seriously enough that they're willing to forego all percieved expected gains by stopping all AI development even though that's what they should do. so, even though making them stop all AI development would be best, i see why some want to offer those organizations a middle-ground between percieved gain and profit.

the sponge coordination approach is notably the one outlined in this diagram from steve byrnes' intro to brain-like-AGI safety:

where the red box is developing a way to make AI "safe" and the blue boxes are the coordination part of the solution.

ultimately, such an approach is likely to only be a temporary one until FAS can be made; after all, we do want a superintelligent benevolent system to help us overcome other challenges we will face, such as the possibility of encountering unaligned alien superintelligences.

i think convincing and coordinating many actors — possibly even small teams of which there are and are going to be many — sounds extremely hard. and it's not like sponge coordination stops the yudkowsky-moore law of mad science, so we'd have to coordinate increasingly many actors until FAS is created.

pivotal acts

a "pivotal act" refers in general to a way we can significantly change the expected outcome of the world for the better. pivotal acts might include developing FAS or sponge coordination, but here i use the term to talk more specifically about solutions that avoid having to either build a FAS or achieve sponge coordination. just like sponge coordination, i expect those to be temporary solutions.

these have, not unreasonably, been considered toxic because it's very hard to determine how to affect the timeline in a way that actually improves our chances of avoiding doom. for example: right now, we're at least able to work on alignment, and the largest AI capability organizations are at least somewhat interacting with the alignment community; it's not clear how that might evolve in the future if the alignment community is percieved to be trying to harm the productivity of AI capability organizations. see here for more thorough criticism.

nevertheless, pivotal acts remain a potential approach in solutionspace, and one that is being talked about, so i'm mentioning it here.

what to pursue?

because of the difficulties of sponge coordination and pivotal acts, i think the most helpful thing to work on at the moment, at least for myself, is AI alignment, typically in a form strong enough to be usable to build a FAS rather than "merely safe" sponges. this may not be the optimal thing for everyone to work on, but i believe AI safety is the optimal thing for most people who are concerned with AI risk to work on.

note that here i'm using the term "AI safety" to mean "making AIs that don't cause doom", whether they are safe in the way DALL-E or a sponge are safe, or safe because we'd trust them to satisfy our values even as a singleton. they don't cause doom (and they might happen to otherwise be useful).

it's useful to talk about AI safety in general, including FAS-building, because it seems like a lot of research that is useful to FAS would be relevant to AI safety in general. but it's not clear that we have to aim for first non-singleton aligned AIs, and then FAS; it could be that the best way to maximize our expected utility is to aim straight for FAS.

(note that i use "AI ethics" to mean the not-particularly-related field of worrying about whether an AI — typically assumed to be safe — is at risk of causing much-smaller-than-existential harm, such as AIs with racist biases. i'm not interested in AI ethics; my priority is to avoid doom and hell, ideally by going straight for a utopia.)

furthermore, i'm thoroughly unqualified to talk about the coordination and politics.

for those reasons, the rest of this post is largely about technical approaches to AI safety, including alignment.

AI safety approaches

i like to think of the work of AI safety/alignment as a rectangle (or really an n-dimensional cylinder) of work to be done, to bridge the gap between what we know how to do and what a full solution would consist of. proposing that a particular idea be used, has the effect of constraining solutionspace and guiding what next to work on. in particular, adding a particular idea splits the work area it's inserted in into two new areas, one below for how we are to implement that idea and one above for how to use that idea to get to the solution.

as new compatible ideas are added, a plan comes into shape where hopefully the work to be done in-between ideas becomes clearer. depending on where they fit in vertically, ideas have different degrees of tractability (we know how to do it) and relevance (it helps get to a solution), and there is generally a tradeoff between the two — that is what it means for the problem to be difficult.

in retrospect of this now-formalized view of the work at hand, a lot of my criticisms of approaches to AI alignment i've seen are that they're either:

  • handwavey: they have pretty bad relevance. it's unclear how they are expected to robustly solve alignment in the face of RSI and other sharp left turns. and if they're expected to continue being highly-inefficient non-RSI neural nets, then they're not expected to achieve singletonhood before some other RSI AI. sometimes it's unclear how, even without any competition, they would lead to good worlds at all.
  • uncompetitive: they have an altogether uncompetitive combination of relevance, tractability, and how good the worlds they would lead to are — they're not at the pareto frontier, and would take too much work compared to other proposals.

prioritizing relevance over tractability

in general, the reason why i gravitate towards solutions that try to have strong formal guarantees is because i'm thinking of a FAS system which needs its alignment to be robust to significant changes in its paradigm; it probly needs to be an AI which can reason about its goals, and applies goal-content integrity and other desired instrumentally convergent goals to them. anything weaker risks failing to ensure its next RSI step has the same goals as itself (after all, we're kind of the first step in that RSI, and we sure are having a very hard time figuring out how to make AI aligned to us!), and anything non-RSI probly gets beaten by another RSI AI that comes along.

another reason to focus on relevance before tractability, is that if we work first on something tractable but not very relevant, we're increasing the chances that someone uses that alignment plan for an AI which will turn out to become a singleton. this increases both

  • X-risks caused by a false sense of security — the person goes "looks like something i can implement!" without realizing ways in which it might fail
  • S-risks caused by botched alignment; this could happen if the alignment plan only partially targets what we want, such as a goal that specifies humans should be kept alive but not under what conditions, or a plan that averages the values of humanity in a way that satisfies the values of people who want there to be a hell, a meat industry like ours, or other horrible things of that kind.

finally, tractable solutions — by virtue of being easier to implement — risk boosting AI capability, and when you're causing damage you really want to know that it's helping alignment enough to be an overall expected net good. you don't want to discover that the work you've been doing, which is supposed to "help capability but at least it's helping alignment too!" is actually not relevant to the nearest alignment solution and just helped capability. i'm starting to increasingly have this concern for interpretability and some other kinds of ML-centric approaches to alignment.

the importance of formalism

for an AI to robustly actually take decisions that actually steer the world in the set of target configurations we want, we need some way to know that it actually cares about actually our values.

something which can be determined to do argmax{a∈A} U(a), can be somewhat expected — at least eventually — to maximize U. this takes a bunch of assumptions, getting from eventual alignment to continuous alignment seems potentially hard, and designing a formal objective function U that actually leads to good worlds when maximized is potentially hard — though one can make attempts at it (1, 2) — but at least there's some notion of trying to steer the AI towards what is maximized being something good.

but in many of the other approaches i see talked about, we don't even get there; we get AIs that are merely heuristics, from which we expect to get some useful results in the short term. but as soon as their capabilities generalize enough we have a hard time predicting that it'll continue to do what we want, and we should guess that it won't because there is usually no thing which keeps pointing to, or storing, the values we want, and there's no reason to expect the AI to generalize its goals in any way that is meaningful to us.

i suspect that making an AI useful pre-singletonhood, and making an AI eventually aligned in the long term (nevermind continuously aligned), are pretty different tasks from one another — what an AI can do when it has started taking over the world and gotten to truly superintelligent levels of capability is radically different from what it will do in limited environments where it hasn't realized things like embedded agency, instrumental convergence, or better decision theories, and is running on — comparatively to what it could get — not just very small but also constant amounts of compute with no RSI.

this is why we need strong guarantees, or at least some idea as to why an AI will continue pursuing goals that lead to desirable outcomes when it gets superintelligent. we need them to start with, or get to, a goal which represents or at least points to desirable worlds, before the AI has crystallized goal-content integrity within itself.

if the AI is in some sense "corrigible", we should expect it to just hack its correction mechanism unless we get a good explanation as to why it wouldn't.

if the AI doesn't reach superintelligent capabilities through RSI, we should expect it to get outcompeted by something else which will.

note that i'm partly talking in ignorance; it's possible that the alignment approaches i've heard of are thinking about, or even have, solutions to the problems i'm bringing up here. but my impression at the moment is that most approaches are starting out with a wonky proposal and then throw ordinary paranoia at them, instead of starting out with a security mindset. on the other hand, more formal guarantees about eventual alignment can get closer to being generally robust; see for example vanessa kosoy's response to steve byrnes's take on her desiderata-first approach:

Let me give an analogy. In cryptography, we have theorems saying that if such-and-such mathematical assumption holds (e.g. X is a one-way function) then a particular protocol is sound. We don't need to list all possible ways an attacker might try to break the protocol: we get safety from any possible attack! (within the assumptions of the model) Is this "too easy"? I don't think so: it requires a lot of hard work to get there, and we're still left with assumptions we don't know how to prove (but we do have high confidence in). Similarly, we're going to need a lot of hard work to get safe protocols for AGI.

this is the sort of general robustness i think we'll be needing, to trust an AI with singletonhood. and without singletonhood, because there are no pivotal weak acts, facebook AI still destroys the world six months later.

unsurprisingly, the surest way to have our values realized is to give capability to an AI which is aligned to begin with, and then make sure it sufficiently understands what realizing those values entails, in order to not fuck up too much on the way there. building something very capable and then trying to patch alignment onto it is the wrong way to go about things; you don't know that you'll be able to patch something onto it, and it might destroy everything before you even get to start trying.

finally, starting from the values might save us huge amounts of time. if aligning current ML models is impossible or would take 50 years, and if aligning something different could take as little 5 years, then we need to align something else. nothing guarantees that, just because ML is how we got to highly capable unaligned systems, it's also the shortest route to highly capable aligned systems; a much more reasonable approach is to first figure out what "aligned" means, and then figure out how to build something which from the ground up is designed to have that property — possibly, but not necessarily, using ML to accomplish it. it might be that ML is very applicable to building a FAS, in which case, great! but it might be the way to go about it is not clear at all without knowing what alignment desiderata would look like, or it could be that ML is in fact not the best way to build FAS at all.

conclusion

the problem at hand is profound, the stakes astronomical, and a lot of the work done to address is thoroughly unable to justify why it's going about it in the way it is.

in my opinion we should figure out what alignment means, what desiderata would formalize it, and then build something that has those.

in retrospect, this approach is straightforward: figure out the solution, then build it. instead, a lot of approaches are committing to building the solution out of the same materials (modern ML technology) that the problem is made of, and then trying to figure out how to arrange those materials in a vaguely solution-shaped way. the problem doesn't care; it won't give us more respect for building a solution that looks like itself. the utmost priority is determining what would make something aligned to a cosmically robust extent. anything weaker than that, and everything dies, everywhere, forever.

or worse.

New Comment
1 comment, sorted by Click to highlight new comments since:

This is one of the clearest top-to-bottom accounts of the alignment problem and related world situation that I've seen here in a while. Thank you for writing it.

i believe, akin to the yudkowsky-moore law of mad science, that the amount of resources it takes for the world to be destroyed — whether on purpose or by accident — keeps decreasing.

Yes it seems that in this particular way the world is becoming more and more unstable

pretty soon (probly this decade or the next), an artificial intelligence capable of undergoing recursive self-improvement (RSI) until it becomes a singleton, and at that point the fate of at least the entire future lightcone will be determined by the goals of that AI.

I think the risk is that one way or another we lock in some mostly-worthless goal to a powerful optimization process. I don't actually think RSI is necessary for that. Beyond that, in practical ML work, we keep seeing systems that are implemented with a world model that is very far from being able to make sense of their own implementation, and actually we seem to be moving further in this direction over time. Google's SayCan, for example, seems to be further from understanding its own implementation than some much more old-fashioned robotics tech from the 1990s (which wasn't exactly close to being able to reason about its own implementation within its world model, either)

the values we want are a very narrow target and we currently have no solid idea how to do alignment, so when AI does take over everything we're probly going to die. or worse, if for example we botch alignment.

Don't assume that the correct solution to the alignment problem consists of alignment to a utility function defined over physical world states. We don't know that for sure and many schools of moral philosophy have formulated ethics in a way that doesn't supervene on physical world states. It's not really clear to me that even hedonistic utilitarianism can be formulated as a utility function over physical world states.

what i call "sponge coordination" is getting everyone who's working on AI to only build systems that are weak and safe just like a sponge, instead of building powerful AIs that take over or destroy everything

I really like the term "sponge coordination" and the definition you've given! And I agree that it's not viable. The basic problem is that we humans are ourselves rapidly becoming unbounded optimizers and so the current world situation is fundamentally not an equilibrium, and we can't just make it an equilibrium by trying to. A solution looks like a deeper change than just asking everyone to keep doing mostly what they're currently doing except not building super-powerful AIs.

[pivotal acts] right now, we're at least able to work on alignment, and the largest AI capability organizations are at least somewhat interacting with the alignment community; it's not clear how that might evolve in the future if the alignment community is percieved to be trying to harm the productivity of AI capability organizations

There are two basic responses possible: either don't perform pivotal acts, or move to a situation where we can perform pivotal acts. It is very difficult to resolve the world situation with AI if one is trying to do so while not harming the productivity of any AI companies. That would be like trying to upgrade a bird to a jumbo jet while keeping it alive throughout.

it's not clear that we have to aim for first non-singleton aligned AIs, and then FAS; it could be that the best way to maximize our expected utility is to aim straight for FAS.

This is true. Thank you for having the courage to say it.

in retrospect of this now-formalized view of the work at hand, a lot of my criticisms of approaches to AI alignment i've seen are that they're either [hand-wavey or uncompetitive]

Indeed. It's a difficult problem and it's okay to formulate and discuss hand-wavey or uncompetitive plans as a stepping stone to formulating precise and competitive plans.

finally, tractable solutions — by virtue of being easier to implement — risk boosting AI capability, and when you're causing damage you really want to know that it's helping alignment enough to be an overall expected net good

This is a very good sentence

this is the sort of general robustness i think we'll be needing, to trust an AI with singletonhood. and without singletonhood

I think you're right to focus on both formalism and trustworthiness, though please do also investigate whether and in what way the former actually leads to the latter.

nothing guarantees that, just because ML is how we got to highly capable unaligned systems, it's also the shortest route to highly capable aligned systems

Yeah well said.

a much more reasonable approach is to first figure out what "aligned" means, and then figure out how to build something which from the ground up is designed to have that property

This is my favorite sentence in your essay

in my opinion we should figure out what alignment means, what desiderata would formalize it, and then build something that has those.

Again, just notice that you are assuming a kind of pre-baked blueprint here, just in putting in the middle step "what desiderata would formalize it". Formal systems are a tool like any other tool: powerful, useful, important to understand the contexts in which they are effective and the contexts in which they are ineffective.