Six AI Risk/Strategy Ideas

Wei Dai

AI risk ideas are piling up in my head (and in my notebook) faster than I can write them down as full posts, so I'm going to condense multiple posts into one again. I may expand some or all of these into full posts in the future. References to prior art are also welcome as I haven't done an extensive search myself yet.

The "search engine" model of AGI development

The current OpenAI/DeepMind model of AGI development (i.e., fund research using only investor / parent company money, without making significant profits) isn't likely to be sustainable, assuming a soft takeoff, but the "search engine" model very well could be. In the "search engine" model, a company (and eventually the AGI itself) funds AGI research and development by selling AI services, while keeping its technology secret. At some point it achieves DSA either by accumulating a big enough lead in AGI technology and other resources to win an open war against the rest of the world, or by being able to simultaneously subvert a large fraction of all cognition done on Earth (i.e., all the AI services that it is offering), causing that cognition to suddenly optimize for its own interests. (This was inspired by / a reply to Daniel Kokotajlo's Soft takeoff can still lead to decisive strategic advantage.)

Coordination as an AGI service

As a refinement of the above, to build a more impregnable monopoly via network effects, the AGI company could offer "coordination as a service", where it promises that any company that hires its AGI as CEO will efficiently coordinate in some fair way with all other companies that also hire its AGI as CEO. See also my AGI will drastically increase economies of scale.

Multiple simultaneous DSAs under CAIS

Suppose CAIS turns out to be a better model than AGI. Many AI services may be natural monopolies and have a large market share for its niche. Suppose many high level AI services all use one particular low level AI service, that lower level AI service (or rather the humans or higher level AI services that have write access to it) could achieve a decisive strategic advantage by subverting the service in a way that causes a large fraction of all cognition on Earth (i.e., all the higher level services that depend on it) to start optimizing for its own interests. Multiple different lower level services could simultaneously have this option. (This was inspired by a comment from ryan_b.)

Logical vs physical risk aversion

Some types of risks may be more concerning than others because they are "logical risks" or highly correlated between Everett branches. Suppose Omega appears and says he is appearing in all Everett branches where some version of you exists and offering you the same choice: If you choose option A he will destroy the universe if the trillionth digit of pi equals the trillionth digit of e, and if you choose option B he will destroy the universe if a quantum RNG returns 0 when generating a random digit. It seems to me that option B is better because it ensures that there's no risk of all Everett branches being wiped out. See The Moral Status of Independent Identical Copies for my intuitions behind this. (How much more risk should we accept under option B until we're indifferent between the two options?)

More realistic examples of logical risks:

AI safety requires solving metaphilosophy.
AI safety requires very difficult global coordination.
Dangerous synthetic biology is easy.

Examples of physical risk:

global nuclear war
natural pandemic
asteroid strike
AI safety doesn't require very difficult global coordination but we fail to achieve sufficient coordination anyway for idiosyncratic reasons.

Combining oracles with human imitations

It seems very plausible that oracles/predictors and human imitations (which can be thought of as a specific kind of predictor) are safer (or more easily made safe) than utility maximizers or other kinds of artificial agents. Each of them has disadvantages though: oracles need a human in the loop to perform actions, which is slow and costly, leading to a competitive disadvantage versus AGI agents, and human imitations can be faster and cheaper than humans but not smarter, also leading to a competitive disadvantage versus AGI agents. Combining the two ideas can result in a more competitive (and still relatively easy to make safe) agent. (See this comment for an example.) This is not a particularly novel idea, since arguably quantilizers and IDA already combine oracles/predictors and human imitations to achieve superintelligent agency, but it still seems worth writing down explicitly.

"Generate evidence of difficulty" as a research purpose

How to handle the problem of AI risk is one of, if not the most important and consequential strategic decisions facing humanity. If we err in the direction of too much caution, in the short run resources are diverted into AI safety projects that could instead go to other x-risk efforts, and in the long run, billions of people could unnecessarily die while we hold off on building "dangerous" AGI and wait for "safe" algorithms to come along. If we err in the opposite direction, well presumably everyone here already knows the downside there.

A crucial input into this decision is the difficulty of AI safety, and the obvious place for decision makers to obtain evidence about the difficulty of AI safety is from technical AI safety researchers (and AI researchers in general), but it seems that not many people have given much thought on how to optimize for the production and communication of such evidence (leading to communication gaps like this one). (As another example, many people do not seem to consider that doing research on a seemingly intractably difficult problem can be valuable because it can at least generate evidence of difficulty of that particular line of research.)

The evidence can be in the form of:

Official or semi-official consensus of the field
Technical arguments about the difficulty of AI safety
"AI Safety Experts" who can state or explain the difficulty of AI safety to a wider audience
Amount of visible progress in AI safety per unit of resources expended
How optimistic or pessimistic safety researchers seem when they talk to each other or to outside audiences

Bias about the difficulty of AI safety is costly/dangerous, so we should think about how to minimize this bias while producing evidence of difficulty. Some possible sources of bias:

Personal bias (due to genetics, background, etc.)
Selection effects (people who think AI safety is intractable because it's too easy or too hard tend to go into other fields)
Incentives (e.g., your job or social status depends on AI safety not being too easy or too hard)

I particularly like your "Logical vs. physical risk aversion" distinction, and agree that we should prioritize reducing logical risk. I think acausal trade makes this particularly concrete. If we make a misaligned superintelligence that "plays nice" in the acausal bargaining community I'd think that's better than making an aligned superintelligence that doesn't, because overall it matters far more that the community is nice than that it have a high population of people with our values.

I also really like your point about how providing evidence that AI safety is difficult may be one of the most important reasons to do AI safety research. I guess I'd like to see some empirically grounded analysis of how likely it is that the relevant policymakers and so forth will be swayed by such things. So far it seems like they've been swayed by direct arguments that the problem is hard, and not so much by our failures to make progress. If anything failure of AI safety researchers to make progress seems to encourage their critics.

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it.

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section.

Minor point: I think asteroid strikes are probably very highly correlated between Everett branches (though maybe the timing of spotting an asteroid on a collision course is variable).

I think if we could look at all the Everett branches that contain some version of you, we'd see "bundles" where the asteroid locations are the same within each bundle but different between bundles, because different bundles evolved from different starting conditions (and then converged in terms of having produced someone who is subjectively indistinguishable from you). So a big asteroid strike would wipe out humanity in an entire bundle but that would only constitute a small fraction of all the Everett branches that contain a version of you.

Hopefully that makes sense?

Ah yes this seems totally correct

Multiple simultaneous DSAs under CAIS

Taking over the world is a big enough prize, compared to the wealth of a typical agent, that even a small chance of achieving it should already be enough to act. And waiting is dangerous if there's a chance of other agents outrunning you. So multiple agents having DSA but not acting for uncertainty reasons seems unlikely.

Logical vs physical risk aversion

Imagine you care about the welfare of two koalas living in separate rooms. Given a choice between both koalas dying with probability 1/2 or a randomly chosen koala dying with probability 1, why is the latter preferable?

You could say our situation is different because we're the koala. Fine. Imagine you're choosing between a 1/2 physical risk and a 1/2 logical risk to all humanity, but both of them will happen in 100 years when you're already dead, so the welfare of your copies isn't at question. Why is the physical risk preferable? How is that different from the koala situation?

Taking over the world is a big enough prize, compared to the wealth of a typical agent, that even a small chance of achieving it should already be enough to act.

In CAIS, AI services aren't agents themselves, especially the lower level ones. If they're controlled by humans, their owners/operators could well be risk verse enough (equivalently, not assign high enough utility to taking over the world) to not take advantage of a DSA given their uncertainty.

Imagine you’re choosing between a 1⁄2 physical risk and a 1⁄2 logical risk to all humanity, but both of them will happen in 100 years when you’re already dead, so the welfare of your copies isn’t at question. Why is the physical risk preferable?

I don't think it's possible for the welfare of my copies to not be at question. See this comment.

Another line of argument is that suppose we'll end up getting most of our utility from escaping simulations and taking over much bigger/richer universes. In those bigger universes we might eventually meet up with copies of us from other Everett branches and have to divide up the universe with them. So physical risk isn't as concerning in that scenario because the surviving branches will end up with larger shares of the base universes.

A similar line of thought is that in an acausal trade scenario, each surviving branch of a physical risk could get a better deal because whatever thing of value they have to offer has become more scarce in the multiverse economy.

Many such intuitions seem to rely on "doors" between worlds. That makes sense - if we have two rooms of animals connected by a door, then killing all animals in one room will just lead to it getting repopulated from the other room, which is better than killing all animals in both rooms with probability 1/2. So in that case there's indeed a difference between the two kinds of risk.

The question is, how likely is a door between two Everett branches, vs. a door connecting a possible world with an impossible world? With current tech, both are impossible. With sci-fi tech, both could be possible, and based on the same principle (simulating whatever is on the other side of the door). But maybe "quantum doors" are more likely than "logical doors" for some reason?

Another argument that definitely doesn't rely on any sort of "doors" for why physical risk might be preferable to logical risk is just if you have diminishing returns on the total number of happy humans. As long as your returns to happy humans are sublinear (logarithmic is a standard approximation, though anything sublinear works), then you should prefer a guaranteed shot at $\frac{1}{2}$ the Everett branches having lots of happy humans to a $\frac{1}{2}$ chance of all the Everett branches having happy humans. To see this, suppose $U : N \to R$ measures your returns to the total number of happy humans across all Everett branches. Let $N$ be the total number of happy humans in a good Everett branch and $M$ the total number of Everett branches. Then, in the physical risk situation, you get $U_{physical risk} = U ⎛ ⎜ ⎜ ⎝ \frac{M}{2} \sum i = 1 N ⎞ ⎟ ⎟ ⎠ = U (\frac{M N}{2})$ whereas, in the logical risk situation, you get $U_{logical risk} = \frac{1}{2} U (0) + \frac{1}{2} U (M \sum i = 1 N) = \frac{1}{2} U (M N)$ which are only equal if $U$ is linear. Personally, I think my returns are sublinear, since I pretty strongly want there to at least be some humans—more strongly than I want there to be more humans, though I want that as well. Furthermore, if you believe there's a chance that the universe is infinite, then you should probably be using some sort of measure over happy humans rather than just counting the number, and my best guess for what such a measure might look like seems to be at least somewhat locally sublinear.

So you're saying that (for example) there could be a very large universe that is running simulations of both possible worlds and impossible worlds, and therefore even if we go extinct in all possible worlds, versions of us that live in the impossible worlds could escape into the base universe so the effect of a logical risk would be similar to a physical risk of equal magnitude (if we get most of our utility from controlling/influencing such base universes). Am I understanding you correctly?

If so, I have two objections to this. 1) Some impossible worlds seem impossible to simulate. For example suppose in the actual world AI safety requires solving metaphilosophy. How would you simulate an impossible world in which AI safety doesn't require solving metaphilosophy? 2) Even for the impossible worlds that maybe can be simulated (e.g., where the trillionth digit of pi is different from what it actually is) it seems that only a subset of reasons for running simulations of possible worlds would apply to impossible worlds, so I'm a lot less sure that "logical doors" exist than I am that "quantum doors" exist.

It seems to me that AI will need to think about impossible worlds anyway - for counterfactuals, logical uncertainty, and logical updatelessness/trade. That includes worlds that are hard to simulate, e.g. "what if I try researching theory X and it turns out to be useless for goal Y?" So "logical doors" aren't that unlikely.

Planned summary for the Alignment Newsletter:

This post briefly presents three ways that power can become centralized in a world with <@Comprehensive AI Services@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@), argues that under risk aversion "logical" risks can be more concerning than physical risks because they are more correlated, proposes combining human imitations and oracles to remove the human in the loop and become competitive, and suggests doing research to generate evidence of difficulty of a particular strand of research.

The first three examples here have been pretty helpful to me in considering how DSAs and takeoffs will go and why they may be dangerous.

I've referred specifically to the section on "Generate evidence of difficulty" as a research purpose many times since this post has come out, and while I have disagreements with it, I do really like it as a handle for a consideration that I hadn't previously seen written up, and does strike me as quite important.

Minor point: I think asteroid strikes are probably very highly correlated between Everett branches (though maybe the timing of spotting an asteroid on a collision course is variable).

Hopefully that makes sense?

Ah yes this seems totally correct

Multiple simultaneous DSAs under CAIS

Logical vs physical risk aversion

Taking over the world is a big enough prize, compared to the wealth of a typical agent, that even a small chance of achieving it should already be enough to act.

Imagine you’re choosing between a 1⁄2 physical risk and a 1⁄2 logical risk to all humanity, but both of them will happen in 100 years when you’re already dead, so the welfare of your copies isn’t at question. Why is the physical risk preferable?

I don't think it's possible for the welfare of my copies to not be at question. See this comment.

Planned summary for the Alignment Newsletter:

This post briefly presents three ways that power can become centralized in a world with <@Comprehensive AI Services@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@), argues that under risk aversion "logical" risks can be more concerning than physical risks because they are more correlated, proposes combining human imitations and oracles to remove the human in the loop and become competitive, and suggests doing research to generate evidence of difficulty of a particular strand of research.

The first three examples here have been pretty helpful to me in considering how DSAs and takeoffs will go and why they may be dangerous.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

31

Six AI Risk/Strategy Ideas

31

The "search engine" model of AGI development

Coordination as an AGI service

Multiple simultaneous DSAs under CAIS

Logical vs physical risk aversion

Combining oracles with human imitations

"Generate evidence of difficulty" as a research purpose