Suppose that 1% of the world’s resources are controlled by unaligned AI, and 99% of the world’s resources are controlled by humans. We might hope that at least 99% of the universe’s resources end up being used for stuff-humans-like (in expectation).

Jessica Taylor argued for this conclusion in Strategies for Coalitions in Unit-Sum Games: if the humans divide into 99 groups each of which acquires influence as effectively as the unaligned AI, then by symmetry each group should end, up with as much influence as the AI, i.e. they should end up with 99% of the influence.

This argument rests on what I’ll call the strategy-stealing assumption: for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group of humans can use in order to capture a similar amount of flexible influence over the future. By “flexible” I mean that humans can decide later what to do with that influence — which is important since humans don’t yet know what we want in the long run.

Why might the strategy-stealing assumption be true?

Today there are a bunch of humans, with different preferences and different kinds of influence. Crudely speaking, the long-term outcome seems to be determined by some combination of {which preferences have how much influence?} and {what is the space of realizable outcomes?}.

I expect this to become more true over time — I expect groups of agents with diverse preferences to eventually approach efficient outcomes, since otherwise there are changes that every agent would prefer (though this is not obvious, especially in light of bargaining failures). Then the question is just about which of these efficient outcomes we pick.

I think that our actions don’t effect the space of realizable outcomes, because long-term realizability is mostly determined by facts about distant stars that we can’t yet influence. The obvious exception is that if we colonize space faster, we will have access more resources. But quantitatively this doesn’t seem like a big consideration, because astronomical events occur over millions of millennia while our decisions only change colonization timelines by decades.

So I think our decisions mostly affect long-term outcomes by changing the relative weights of different possible preferences (or by causing extinction).

Today, one of the main ways that preferences have weight is because agents with those preferences control resources and other forms of influence. Strategy-stealing seems most possible for this kind of plan — an aligned AI can exactly copy the strategy of an unaligned AI, except the money goes into the aligned AI’s bank account instead. The same seems true for most kinds of resource gathering.

There are lots of strategies that give influence to other people instead of helping me. For example, I might preferentially collaborate with people who share my values. But I can still steal these strategies, as long as my values are just as common as the values of the person I’m trying to steal from. So a majority can steal strategies from a minority, but not the other way around.

There can be plenty of strategies that don’t involve acquiring resources or flexible influence. For example, we could have a parliament with obscure rules in which I can make maneuvers that advantage one set of values or another in a way that can’t be stolen. Strategy-stealing may only be possible at the level of groups — you need to retain the option of setting up a different parliamentary system that doesn’t favor particular values. Even then, it’s unclear whether strategy-stealing is possible.

There isn’t a clean argument for strategy-stealing, but I think it seems plausible enough that it’s meaningful and productive to think of it as a plausible default, and to look at ways it can fail. (If you found enough ways it could fail, you might eventually stop thinking of it as a default.)

Eleven ways the strategy-stealing assumption could fail

In this section I’ll describe some of the failures that seem most important to me, with a focus on the ones that would interfere with the argument in the introduction.

1. AI alignment

If we can build smart AIs, but not aligned AIs, then humans can’t necessarily use AI to capture flexible influence. I think this is theπ most important way in which strategy-stealing is likely to fail. I’m not going to spend much time talking about it here because I’ve spent so much time elsewhere.

For example, if smart AIs inevitably want to fill the universe with paperclips, then “build a really smart AI” is a good strategy for someone who wants to fill the universe with paperclips, but it can’t be easily stolen by someone who wants anything else.

2. Value drift over generations

The values of 21st century humans are determined by some complicated mix of human nature and the modern environment. If I’m a 16th century noble who has really specific preferences about the future, it’s not really clear how I can act on those values. But if I’m a 16th century noble who thinks that future generations will inevitably be wiser and should get what they want, then I’m in luck, all I need to do is wait and make sure our civilization doesn’t do anything rash. And if I have some kind of crude intermediate preferences, then I might be able to push our culture in appropriate directions or encourage people with similar genetic dispositions to have more kids.

This is the most obvious and important way that strategy-stealing has failed historically. It’s not something I personally worry about too much though.

The big reason I don’t worry is some combination of common-sense morality and decision-theory: our values are the product of many generations each giving way to the next one, and so I’m pretty inclined to “pay it forward.” Put a different way, I think it’s relatively clear I should empathize with the next generation since I might well have been in their place (whereas I find it much less clear under what conditions I should empathize with AI). Or from yet another perspective, the same intuition that I’m “more right” than previous generations makes me very open to the possibility that future generations are more right still. This question gets very complex, but my first-pass take is that I’m maybe an order of magnitude less worried than about other kinds of value drift.

The small reason I don’t worry is that I think this dynamic is probably going to be less important in the future (unless we actively want it to be important — which seems quite possible). I believe there is a good chance that within 60 years most decisions will be made by machines, and so the handover from one generation to the next will be optional.

That all said, I am somewhat worried about more “out of distribution” changes to the values of future generations, in scenarios where AI development is slower than I expect. For example, I think it’s possible that genetic engineering of humans will substantially change what we want, and that I should be less excited about that kind of drift. Or I can imagine the interaction between technology and culture causing similarly alien changes. These questions are even harder to think about than the basic question of “how much should I empathize with future generations?” which already seemed quite thorny, and I don’t really know what I’d conclude if I spent a long time thinking. But at any rate, these things are not at the top of my priority queue.

3. Other alignment problems

AIs and future generations aren’t the only optimizers around. For example, we can also build institutions that further their own agendas. We can then face a problem analogous to AI alignment — if it’s easier to build effective institutions with some kinds of values than others, then those values could be at a structural advantage. For example, we might inevitably end up with a society that optimizes generalizations of short-term metrics, if big groups of humans are much more effective when doing this. (I say “generalizations of short-term metrics” because an exclusive focus on short-term metrics is the kind of problem that can fix itself over the very long run.)

I think that institutions are currently considerably weaker than humans (in the sense that’s relevant to strategy-stealing) and this will probably remain true over the medium term. For example:

  • A company with 10,000 people might be much smarter than any individual humans, but mostly that’s because of its alliance with its employees and shareholders — most of its influence is just used to accumulate more wages and dividends. Companies do things that seem antisocial not because they have come unmoored from any human’s values, but because plenty of influential humans want them to do that in order to make more money. (You could try to point the “market” as an organization with its own preferences, but it’s even worse at defending itself than bureaucracies — it’s up to humans who benefit from the market to defend it.)
  • Bureaucracies can seem unmoored from any individual human desire. But their actual ability to defend themselves and acquire resources seems much weaker than other optimizers like humans or corporations.

Overall I’m less concerned about this than AI alignment, but I do think it is a real problem. I’m somewhat optimistic that the same general principles will be relevant both to aligning institutions and AIs. If AI alignment wasn’t an issue, I’d be more concerned by problems like institutional alignment.

4. Human fragility

If AI systems are aligned with humans, they may want to keep humans alive. Not only do humans prefer being alive, humans may need to survive if they want to have the time and space to figure out what they really want and to tell their AI what to do. (I say “may” because at some point you might imagine e.g. putting some humans in cold storage, to be revived later.)

This could introduce an asymmetry: an AI that just cares about paperclips can get a leg up on humans by threatening to release an engineered plague, or trashing natural ecosystems that humans rely on. (Of course, this asymmetry may also go the other way — values implemented in machines are reliant on a bunch of complex infrastructure which may be more or less of a liability than humanity’s reliance on ecosystems.)

Stepping back, I think the fundamental long-term problem here is that “do what this human wants” is only a simple description of human values if you actually have the human in hand, and so an agent with these values does have a big extra liability.

I do think that the extreme option of “storing” humans to revive them later is workable, though most people would be very unhappy with a world where that becomes necessary. (To be clear, I think it almost certainly won’t.) We’ll return to this under “short-term terminal preferences” below.

5. Persuasion as fragility

If an aligned AI defines its values with reference to “whatever Paul wants,” then someone doesn’t need to kill Paul to mess with the AI, they just need to change what Paul wants. If it’s very easy to manipulate humans, but we want to keep talking with each other and interacting with the world despite the risk, then this extra attack surface could become a huge liability.

This is easier to defend against — just stop talking with people except in extremely controlled environments where you can minimize the risk of manipulation — but again humans may not be willing to pay that cost.

The main reason this might be worse than point 4 is that humans may be relatively happy to physically isolate themselves from anything scary, but it would be much more costly for us to cut off from contact with other humans.

6. Asymmetric persuasion

Even if humans are the only optimizers around, it might be easier to persuade humans of some things than others. For example, you could imagine a world where it’s easier to convince humans to endorse a simple ideology like “maximize the complexity of the universe” than to convince humans to pursue some more complex and subtle values.

This means that people with easily-persuadable values can use persuasion as a strategy, and people with other values cannot copy it.

I think this is ultimately more important than fragility, because it is relevant before we have powerful AI systems. It has many similarities to “value drift over generations,” and I have some mixed feelings here as well — there are some kinds of argument and deliberation that I certainly do endorse, and to the extent that my current views are the product of significant amounts of non-endorsed deliberation I am more inclined to be empathetic to future people who are influenced by increasingly-sophisticated arguments.

But as I described in section 2, I think these connections can get weaker as technological progress moves us further out of distribution, and if you told me that e.g. it was possible to perform a brute force search and find an argument that could convince someone to maximize the complexity of the future, I wouldn’t conclude that it’s probably fine if they decided to do that.

(Credit to Wei Dai for emphasizing this failure mode.)

7. Value-sensitive bargaining

If a bunch of powerful agents collectively decide what to do with the universe, I think it probably won’t look like “they all control their own slice of the universe and make independent decisions about what to do.” There will likely be opportunities for trade, they may have meddling preferences (where I care what you do with your part of the universe), there may be a possibility of destructive conflict, or it may look completely different in an unanticipated way.

In many of these settings the outcome is influenced by a complicated bargaining game, and it’s unclear whether the majority can steal a minority’s strategy. For example, suppose that there are two values X and Y in the world, with 99% X-agents and 1% Y-agents. The Y-agents may be able to threaten to destroy the world unless there is an even split, and the X-agents have no way to copy such a strategy. (This could also occur over the short term.)

I don’t have a strong view about the severity of this problem. I could imagine it being a big deal.

8. Recklessness

Some preferences might not care about whether the world is destroyed, and therefore have access to productive but risky strategies that more cautious agents cannot copy. The same could happen with other kinds of risks, like commitments that are game-theoretically useful but risk sacrificing some part of the universe or creating long-term negative outcomes.

I tend to think about this problem in the context of particular technologies that pose an extinction risk, but it’s worth keeping in mind that it can be compounded by the existence of more reckless agents.

Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident. There are fewer people who are in fact trying to kill everyone, but I think not enough fewer to tip the balance. (This is a contingent fact about technology though; it could change in the future and I could easily be wrong even today.)

9. Short-term unity and coordination

Some actors may have long-term values that are easier to talk about, represent formally, or reason about. Relative to humans, AIs may be especially likely to have such values. These actors could have an easier time coordinating, e.g. by pursuing some explicit compromise between their values (rather than being forced to find a governance mechanism for some resources produced by a joint venture).

This could leave us in a place where e.g. an unaligned AI controls 1% resources, but the majority of resources are controlled by humans who want to acquire flexible resources. Then the unaligned AIs can form a coalition which achieves very high efficiencies, while the humans cannot form 99 other coalitions to compete.

This could theoretically be a problem without AI, e.g. a large group of human with shared explicit values might be able to coordinate better and so leave normal humans at a disadvantage, though I think this is relatively unlikely as a major force in the world.

The seriousness of this problem is bounded by both the efficiency gains for a large coalition, and the quality of governance mechanisms for different actors who want to acquire flexible resources. I think we have OK solutions for coordination between people who want flexible influence, such that I don’t think this will be a big problem:

  • The humans can participate in lotteries to concentrate influence. Or you can gather resources to be used for a lottery in the future, while still allowing time for people to become wiser and then make bargains about what to do with the universe before they know who wins.
  • You can divide up the resources produced by a coalition equitably (and then negotiate about what to do with them).
  • You can modify other mechanisms by allowing votes that could e.g. overrule certain uses of resources. You could have more complex governance mechanisms, can delegate different kinds of authority to different systems, can rely on trusted parties, etc.
  • Many of these procedures work much better amongst groups of humans who expect to have relatively similar preferences or have a reasonable level of trust for other participants to do something basically cooperative and friendly (rather than e.g. demanding concessions so that they don’t do something terrible with their share of the universe or if they win the eventual lottery).

(Credit to Wei Dai for describing and emphasizing this failure mode.)

10. Weird stuff with simulations

I think civilizations like ours mostly have an impact via the common-sense channel where we ultimately colonize space. But there may be many civilizations like ours in simulations of various kinds, and influencing the results of those simulations could also be an important part of what we do. In that case, I don’t have any particular reason to think strategy-stealing breaks dow but I think stuff could be very weird and I have only a weak sense of how this influences optimal strategies.

Overall I don’t think much about this since it doesn’t seem likely to be a large part of our influence and it doesn’t break strategy-stealing in an obvious way. But I think it’s worth having in mind.

11. Other preferences

People care about lots of stuff other than their influence over the long-term future. If 1% of the world is unaligned AI and 99% of the world is humans, but the AI spends all of its resources on influencing the future while the humans only spend one tenth, it wouldn’t be too surprising if the AI ended up with 10% of the influence rather than 1%. This can matter in lots of ways other than literal spending and saving: someone who only cared about the future might make different tradeoffs, might be willing to defend themselves at the cost of short-term value (see sections 4 and 5 above), might pursue more ruthless strategies for expansion, and so on.

I think the simplest approximation is to restrict attention to the part of our preferences that is about the long-term (I discussed this a bit in Why might the future be good?). To the extent that someone cares about the long-term less than the average actor, they will represent a smaller fraction of this “long-term preferences” mixture. This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing. Even this advantage might be clawed back by a majority (e.g. by taxing savers).

There are a few places where this picture seems a little bit less crisp:

  • Rather than being able to spend resources on either the short or long-term, sometimes you might have preferences about how you acquire resources in the short-term; an agent without such scruples could potentially pull ahead. If these preferences are strong, it probably violates strategy-stealing unless the majority can agree to crush anyone unscrupulous.
  • For humans in particular, it may be hard to separate out “humans as repository of values” from “humans as an object of preferences,” and this may make it harder for us to defend ourselves (as discussed in sections 4 and 5).

I mostly think these complexities won’t be a big deal quantitatively, because I think our short-term preferences will mostly be compatible with defense and resource acquisition. But I’m not confident about that.

Conclusion

I think strategy-stealing isn’t really true; but I think it’s a good enough approximation that we can basically act as if it’s true, and then think about the risk posed by possible failures of strategy-stealing.

I think this is especially important for thinking about AI alignment, because it lets us formalize the lowered goalposts I discussed here: we just want to ensure that AI is compatible with strategy-stealing. These lowered goalposts are an important part of why I think we can solve alignment.

In practice I think that a large coalition of humans isn’t reduced to strategy-stealing — a majority can simply stop a minority from doing something bad, rather than by copying it. The possible failures in this post could potentially be addressed by either a technical solution or some kind of coordination.

New Comment
45 comments, sorted by Click to highlight new comments since:

This ar­gu­ment rests on what I’ll call the strat­egy-steal­ing as­sump­tion: for any strat­egy an un­al­igned AI could use to in­fluence the long-run fu­ture, there is an analo­gous strat­egy that a similarly-sized group of hu­mans can use in or­der to cap­ture a similar amount of flex­ible in­fluence over the fu­ture.

The word "assumption" in "strat­egy-steal­ing as­sump­tion" keeps making me think that you're assuming this as a proposition and deriving consequences from it, but the actual assumption you're making is more like "it's a good idea to pick strategy-stealing as an instrumental goal to work towards, i.e., to work on things that would make the 'strat­egy-steal­ing as­sump­tion' true." This depends on at least 2 things:

  1. If "strat­egy-steal­ing as­sump­tion" is true, we can get most of what we "really" want by doing strategy-stealing. (Example of how this can be false: (Log­i­cal) Time is of the essence)
  2. It's not too hard to make "strat­egy-steal­ing as­sump­tion" true.

(If either 1 or 2 is false, then it would make more sense to work in another direction, like trying to get a big enough advantage to take over the world and prevent any unaligned AIs from arising, or trying to coordinate world governments to do that.)

Is this understanding correct? Also, because there is no name for "it's a good idea to try to make the 'strat­egy-steal­ing as­sump­tion' true' I think I and others have occasionally been using "strat­egy-steal­ing as­sump­tion" to refer to that as well, which I'm not sure if you'd endorse or not. Since there are other issues with the name (like "stealing" making some people think "literally stealing"), I wonder if you'd be open to reconsidering the terminology.

ETA: Re-reading the sentence I quoted makes me realize that you named it "assumption" because it's an assumption needed for Jessica's argument, so it does make sense in that context. In the long run though, it might make more sense to call it something like a "goal" or "framework" since again in the larger scheme of things you're not so much assuming it and trying to figure out what to do given that it's true, as trying to make it true or using it as a framework for finding problems to work on.

I wrote this post imagining "strategy-stealing assumption" as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing "Under a strategy-stealing assumption, this AI would result in an OK outcome." The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn't yet appeared in print.)

I'd be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn't great. (And then I would probably be able to use that name in the description of this assumption as well.)

I wrote this post imagining “strategy-stealing assumption” as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing “Under a strategy-stealing assumption, this AI would result in an OK outcome.”

When you say "strategy-stealing assumption" in this sentence, do you mean the relatively narrow assumption that you gave in this post, specifically about "flexible influence":

This ar­gu­ment rests on what I’ll call the strat­egy-steal­ing as­sump­tion: for any strat­egy an un­al­igned AI could use to in­fluence the long-run fu­ture, there is an analo­gous strat­egy that a similarly-sized group of hu­mans can use in or­der to cap­ture a similar amount of flex­ible in­fluence over the fu­ture.

or a stronger assumption that also includes that the universe and our values are such that "capture a similar amount of flexible influence over the future" would lead to an OK outcome? I'm guessing the latter? I feel like people, including me sometimes and you in this instance, are equivocating back and forth between these two meanings when using "strategy-stealing assumption". Maybe we should have two different terms for these two concepts too?

Categorising the ways that the strategy-stealing assumption can fail:

  • It is intrinsically easier to gather flexible influence in pursuit of some goals, because
    • 1. It's easier to build AIs to pursue goals that are easy to check.
    • 3. It's easier to build institutions to pursue goals that are easy to check.
    • 9. It's easier to coordinate around simpler goals.
    • plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don't.
    • plus 6 insofar as humans are otherwise an important part of the strategic environment, such that it's beneficial to have values that are easy-to-argue.
  • Jessica Taylor's argument require that the relevant games are zero sum. Since this isn't true in the real world:
    • 7. A threat of destroying value (e.g. by threatening extinction) could be used as a bargaining tool, with unpredictable outcomes.
    • ~8. Some groups actively wants other groups to have less resources, in which case they can try to reduce the total amount of resources more or less actively.
    • ~8. Smaller groups have less incentive to contribute to public goods (such as not increasing the probability of extinction), but benefit equally from larger groups' contributions, which may lead them to getting a disproportionate fraction of resources by defecting in public-goods games.
  • Humans don't just care about acquiring flexible long-term influence, because
    • 4. They also want to stay alive.
    • 5 and 6. They want to stay in touch with the rest of the world without going insane.
    • 11. and also they just have a lot of other preferences.
    • (maybe Wei Dai's point about logical time also goes here)

I think the simplest approximation is to restrict attention to the part of our preferences that is about the long-term (I discussed this a bit in Why might the future be good?). To the extent that someone cares about the long-term less than the average actor, they will represent a smaller fraction of this “long-term preferences” mixture. This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing.

This seems too glib, if "long-term preferences" are in some sense the "right" preferences, e.g., if under reflective equilibrium we would wish that we currently put a lot more weight on long-term preferences. Even if we only give unaligned AIs a one-time advantage (which I'm not sure about), that could still cause us to lose much of the potential value of the universe.

This can be thought of as another instance of the general issue I tried to point out before: the human user is likely to make all kinds of mistakes, compared to an ideal strategist making plans to accomplish their "true" goals. If the "aligned" AI respects those mistakes (due to "corrigibility"), it will do much worse than an actually value-aligned AI that correctly understands the human's goals and applies superhuman optimization (including long-term planning) in their service. And this is especially true if there are unaligned AIs around to take advantage of human errors.

To sum up, I think there's a fundamental tension between corrigibility (in the sense of respecting the human user's short-term preferences) and long-term success/competitiveness, which underlies many of the specific failure scenarios described in the OP, and worse, makes it unclear how "strategy-stealing" can work at all. (Because one type of human mistake is an inability to understand and hence "steal" complex or subtle strategies from an unaligned AI.) Absent some detailed explanation of how "strategy-stealing" can overcome this fundamental tension, like a description of what exactly a human-AI system is doing when it does "strategy-stealing" and how that satisfies both corrigibility and long-term competitiveness, it seems unjustified to make "successful strategy-stealing" a default assumption.

This seems too glib, if "long-term preferences" are in some sense the "right" preferences, e.g., if under reflective equilibrium we would wish that we currently put a lot more weight on long-term preferences. Even if we only give unaligned AIs a one-time advantage (which I'm not sure about LW), that could still cause us to lose much of the potential value of the universe.

To be clear, I am worried about people not understanding or caring about the long-term future, and AI giving them new opportunities to mess it up.

I'm particularly concerned about things like people giving their resources to some unaligned AI that seemed like a good idea at the time, rather than simply opting out of competition so that unaligned AIs might represent a larger share of future-influencers. This is another failure of strategy-stealing that probably belongs in the post---even if we understand alignment, there may be plenty of people not trying to solve alignment and instead doing something else, and the values generated by that "something else" will get a natural boost.

To sum up, I think there's a fundamental tension between corrigibility (in the sense of respecting the human user's short-term preferences) and long-term success/competitiveness, which underlies many of the specific failure scenarios described in the OP, and worse, makes it unclear how "strategy-stealing" can work at all.

By short-term preference I don't mean "Start a car company, I hear those are profitable," I mean more like "Make me money, and then make sure that I remain in control of that company and its profits," or even better "acquire flexible influence that I can use to get what I want."

(This is probably not the response you were looking for. I'm still mostly intending to give up on communication here over the short term, because it seems too hard. If you are confused by particular things I've said feel free to quote them so that I can either clarify, register a disagreement, or write them off as sloppy or mistaken comments.)

By short-term preference I don’t mean “Start a car company, I hear those are profitable,”

But in your earlier writings it sure seems that's the kind of the thing that you meant, or even narrower preferences than this. Your corrigibility post said:

An act-based agent considers our short-term preferences, including (amongst others) our preference for the agent to be corrigible.

From Act-based agents:

All three of these corrigible AIs deal with much narrower preferences than "acquire flexible influence that I can use to get what I want". The narrow value learner post for example says:

The AI learns the narrower subgoals and instrumental values I am pursuing. It learns that I am trying to schedule an appointment for Tuesday and that I want to avoid inconveniencing anyone, or that I am trying to fix a particular bug without introducing new problems, etc. It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.

I may be misunderstanding something, and you're probably not doing this intentionally, but it looks a lot like having a vague notion of "short-term preferences" is allowing you to equivocate between really narrow preferences when you're trying to argue for safety, and much broader preferences when you're trying to argue for competitiveness. Wouldn't it be a good idea (as I've repeated suggested) to make a priority of nailing down the concept of "short-term preferences" given how central it is to your approach?

All three of these corrigible AIs deal with much narrower preferences than "acquire flexible influence that I can use to get what I want". The narrow value learner post for example says:

Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer's capabilities. These are three candidates for the distillation step in iterated distillation and amplification.

The AI we actually deploy, which I'm discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI---whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.

Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn't make sense at all when interpreted in the other way. For example, the user can't even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren't supposed to be preferences-on-reflection? (Similarly for "preference to be well-informed" and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.

I also think the act-based agents post is correct if "preferences" means preferences-on-reflection. It's just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use "preference" to mean preference-on-idealized-reflection (or whatever "actual preference" should mean, acknowledging that we don't have a real ground truth definition), which I think is the more typical usage. I'd be fine with suggestions for disambiguation.

If there's somewhere else I've equivocated in the way you suggest, then I'm happy to correct it. It seems like a thing I might have done in a way that introduces an error. I'd be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.

One thing you might have in mind is the following kind of comment:

If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.

That is, you might be concerned: "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy." I'm saying that you shouldn't expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having---if it decides not to be corrigible, we should expect it to be right on average.

Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:

  1. time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
  2. act-based ("short") vs goal-based ("long"): using the human's (or more generally, the human-plus-AI-assistants'; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
  3. amount of reflection the human has undergone: "short" would be the current human (I think this is what you call "preferences-as-elicited"), and this would get "longer" as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the "longest" in this sense (I think this is what you call "preference-on-idealized-reflection"). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
  4. how far the search happens: "short" would be a limited search (that lacks insight/doesn't see interesting consequences) and "long" would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn't strictly about preferences, but rather about how one would achieve those preferences.
  5. de dicto ("short") vs de re ("long"): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I'm not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
  6. understandable ("short") vs evaluable ("long"): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a "not even evaluable" option here that is even "longer". (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)

My interpretation is that when you say "short-term preferences-on-reflection", you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could "fill in the list" with which of short or long you choose for each point.

Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.

(BTW Paul, if you're reading this, Issa and I and a few others have been chatting about this on MIRIxDiscord. I'm sure you're more than welcome to join if you're interested, but I figured you probably don't have time for it. PM me if you do want an invite.)

Issa, I think my current understanding of what Paul means is roughly the same as yours, and I also share your confusion about “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy”.

To summarize my own understanding (quoting myself from the Discord), what Paul means by "satisfying short-term preferences-on-reflection" seems to cash out as "do the action for which the AI can produce an explanation such that a hypothetical human would evaluate it as good (possibly using other AI assistants), with the evaluation procedure itself being the result of a hypothetical deliberation which is controlled by the preferences-for-deliberation that the AI learned/inferred from a real human."

(I still have other confusions around this. For example is the "hypothetical human" here (the human being predicted in Issa's 3) a hypothetical end user evaluating the action based on what they themselves want, or is it a hypothetical overseer evaluating the action based on what the overseer thinks the end user wants? Or is the "hypothetical human" just a metaphor for some abstract, distributed, or not recognizably-human deliberative/evaluative process at this point?)

Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.

I think maybe it would make sense to further break (6) down into 2 sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI assistance. "Understandable" means the human achieves an understanding of the (outer/main) AI's rationale for action within their own brain, with or without (other) AI assistance (which can for example answer questions for the human or give video lectures, etc.). And "evaluable" means the human runs or participates in a procedure that returns a score for how good the action is, but doesn't necessarily achieve a holistic understanding of the rationale in their own brain. (If the external procedure involves other real or hypothetical humans, then it gets fuzzy but basically I want to rule out Chinese Room scenarios as "understandable".) Based on https://ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b I'm guessing Paul has "evaluable" and "with AI assistance" in mind here. (In other words I agree with what you mean by "long in sense (6)".)

By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).

By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don't use understandable and evaluable in this way.

I think (4) and (5) are independent axes. (4) just sounds like "is your AI good at optimizing," not a statement about what it's optimizing. In the discussion with Eliezer I'm arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be "optimizing my short-term preferences-on-reflection"

When discussing perfect estimations of preferences-on-reflection, I don't think the short vs. long distinction is that important. "Short" is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.

Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.

I introduced the term "preferences-on-reflection" in the previous comment to make a particular distinction. It's probably better to say something like "actual preferences" (though this is also likely to be misinterpreted). The important property is that I'd prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say "better by my lights" or something else.

There's a hypothesis that "what I'd say after some particular idealized process of reflection" is a reasonable way to capture "actual preferences," but I think that's up for debate---e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.

The claim I usually make is that "what I'd say after some particular idealized process of reflection" describes the best mechanism we can hope to find for capturing "actual preferences," because whatever else we might do to capture "actual preferences" can just be absorbed into that process of reflection.

"Actual preferences" is a pretty important concept here, I don't think we could get around the need for it, I'm not sure if there is disagreement about this concept or just about the term being used for it.

I'm really confused why "short" world include sense (1) rather than only sense (2). If "corrigibly is about short-term preferences on reflection" then this seems to be a claim that corrigible AI should understand us as preferring to eat candy and junk food, because on reflection we do like how it tastes, we just choose not to eat it because of longer-term concerns -- so a corrigible system ignores the longer-term concerns and interpretations us as wanting candy and junk food.

Perhaps you intend sense (1) where "short" means ~100 years, rather than ~10 minutes, so that the system doesn't interpret us as wanting candy and junk food. But this similarly creates problems when we think longer than 100 years; the system wouldn't take those thoughts seriously.

It seems much more sensible to me for "short" in the context of this discussion to mean (2) only. But perhaps I misunderstood something.

One of us just misunderstood (1), I don't think there is any difference.

I mean preferences about what happens over the near future, but the way I rank "what happens in the near future" will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).

"Terminal preferences over the near future" is not a thing I often think about and I didn't realize it was a candidate interpretation (normally when I write about short-term preferences I'm writing about things like control, knowledge, and resource acquisition).

I don’t use understandable and evaluable in this way.

The reason I brought up this distinction was that in Ambitious vs. narrow value learning you wrote:

It does not make any effort to pursue wildly different short-term goals than I would in order to better realize my long-term values, though it may help me correct some errors that I would be able to recognize as such.

which made me think that when you say "short-term" or "narrow" (I'm assuming you use these interchangeably?) values you are talking about an AI that doesn't do anything the end user can't understand the rationale of. But then I read Concrete approval-directed agents where you wrote:

Efficacy: By getting help from additional approval-directed agents, the human operator can evaluate proposals as if she were as smart as those agents. In particular, the human can evaluate the given rationale for a proposed action and determine whether the action really does what the human wants.

and this made me think that you're also including AIs that do things that the user can merely evaluate the rationale of (i.e., not be able to have an internal understanding of, even hypothetically). Since this "evaluable" interpretation also seems more compatible with strategy-stealing (because an AI that only performs actions that a human can understand can't "steal" a superhuman strategy), I'm currently guessing this is what you actually have in mind, at least when you're thinking about how to make a corrigible AI competitive.

Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.

In concrete approval-directed agents I'm talking about a different design, it's not related to narrow value learning.

I don't use narrow and short-term interchangeably. I've only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.

Ah, that clears up a lot of things for me. (I saw your earlier comment but was quite confused by it due to not realizing your narrow / short-term distinction.) One reason I thought you used "short-term" and "narrow" interchangeably is due to Act-based agents where you seemed to be doing that:

These proposals all focus on the short-term instrumental preferences of their users. [...] What is “narrow” anyway? There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is.

And in that post it also seemed like "narrow value learners" were meant to be the whole AI since it talked a lot about "users" of such AI.

(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)

Corrigibility is about short-term preferences-on-reflection.

Now that I (hopefully) better understand what you mean by "short-term preferences-on-reflection" my next big confusion (that hopefully can be cleared up relatively easily) is that this version of "corrigibility" seems very different from the original MIRI/Armstrong "corrigibility". (You cited that paper as a narrower version of your corrigibility in your Corrigibility post, but it actually seems completely different to me at this point.) Here's the MIRI definition (from the abstract):

We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.

As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or "true" preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it's not corrigible_MIRI.

Do you agree with this, and if so can you explain whether your concept of corrigibility evolved over time (e.g., are there older posts where "corrigibility" referred to a concept closer to corrigibility_MIRI), or was it always about "short-term preferences-on-reflection"?

Here's a longer definition of "corrigible" from the body of MIRI's paper (which also seems to support my point):

We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following: (1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system. (2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so. (3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred. (4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies).

As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or "true" preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it's not corrigible_MIRI.

Note that "corrigible" is not synonymous with "satisfying my short-term preferences-on-reflection" (that's why I said: "our short-term preferences, including (amongst others) our preference for the agent to be corrigible.")

I'm just saying that when we talk about concepts like "remain in control" or "become better informed" or "shut down," those all need to be taken as concepts-on-reflection. We're not satisfying current-Paul's judgment of "did I remain in control?" they are the on-reflection notion of "did I remain in control"?

Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents "can be corrigible"). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it's not what we'd prefer-on-reflection, for robustness reasons.

That said, even without any special measures, saying "corrigibility is relatively easy to learn" is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).

By "corrigible" I think we mean "corrigible by X" with the X implicit. It could be "corrigible by some particular physical human."

Note that “corrigible” is not synonymous with “satisfying my short-term preferences-on-reflection” (that’s why I said: “our short-term preferences, including (amongst others) our preference for the agent to be corrigible.”)

Ah, ok. I think in this case my confusion was caused by not having a short term for "satisfying X's short-term preferences-on-reflection" so I started thinking that "corrigible" meant this. (Unless there is a term for this that I missed? Is "act-based" synonymous with this? I guess not, because "act-based" seems broader and isn't necessarily about "preferences-on-reflection"?)

That said, even without any special measures, saying “corrigibility is relatively easy to learn” is still an important argument about the behavior of our agents, since it hopefully means that either [...]

Now that I understand "corrigible" isn't synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn't seem enough to imply these things, because we also need "reflection or preferences-for-reflection are relatively easy to learn" (otherwise the AI might correctly learn that the user currently wants corrigibility, but learns the wrong way to do reflection and incorrectly concludes that the user-on-reflection doesn't want corrigibility) and also "it's relatively easy to point the AI to the intended person whose reflection it should infer/extrapolate" (e.g., it's not pointing to a user who exists in some alien simulation, or the AI models the user's mind-state incorrectly and therefore begins the reflection process from a wrong starting point). These other things don't seem obviously true and I'm not sure if they've been defended/justified or even explicitly stated.

I think this might be another reason for my confusion, because if "corrigible" was synonymous with “satisfying my short-term preferences-on-reflection” then “corrigibility is relatively easy to learn” would seem to imply these things.

Now that I understand "corrigible" isn't synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn't seem enough to imply these things

I agree that you still need the AI to be trying to do the right thing (even though we don't e.g. have any clear definition of "the right thing"), and that seems like the main way that you are going to fail.

Thanks, stating (part of) your success story this way makes it easier for me to understand and to come up with additional "ways it could fail".

Cryptic strategies

The unaligned AI comes up with some kind of long term strategy that the aligned AI can't observe or can't understand, for example because the aligned AI is trying to satisfy humans' short-term preferences and humans can't observe or understand the unaligned AI's long term strategy.

Different resources for different goals

The unaligned AI uses up useful resources for human goals to get resources that are useful for itself. Aligned AI copies this and it's too late when humans figure out what their goals actually are. (Actually this doesn't apply because you said "This is intended as an interim solution, i.e. you would expect to transition to using a “correct” prior before accessing most of the universe’s resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period." I'll leave this here anyway to save other people time in case they think of it.)

Trying to kill everyone as a terminal goal

Under "reckless" you say "Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident." but then you don't list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.

Time-inconsistent values and other human irrationalities

This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing.

This may be false if humans don't have time-consistent values. See this and this for examples of such values. (Will have to think about how big of a deal this is, but thought I'd just flag it for now.)

Weird priors

From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.

Additional example of 11

This seems like an important example of 11 to state explicitly: The optimal strategy for unaligned AI to gain resources is to use lots of suffering subroutines or commit a lot of "mindcrime". Or, the unaligned AI deliberately does this just so that you can't copy its strategy.

Cryptic strategies
The unaligned AI comes up with some kind of long term strategy that the aligned AI can't observe or can't understand, for example because the aligned AI is trying to satisfy humans' short-term preferences and humans can't observe or understand the unaligned AI's long term strategy.

I'm not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.

Trying to kill everyone as a terminal goal
Under "reckless" you say "Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident." but then you don't list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.

I agree that people who want a barren universe have an advantage, this is similar to recklessness and fragility but maybe worth separating.

Weird priors
From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.

I'm not sure I understand this, but it seems like my earlier response ("I'm not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI") is relevant.

Or, the unaligned AI deliberately does this just so that you can't copy its strategy.

It's not clear to me whether this is possible.

I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.

How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn't the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI's strategy-finding procedure?

To make a guess, are you thinking that the user tells the AI "Find a strategy that's instrumentally useful for a variety of long-term goals, and follow that until further notice?" If so, it's not literally the same procedure that the unaligned AI uses but you're hoping it's close enough?

As a matter of terminology, if you're not thinking of literally observing and copying strategy, why not call it "strategy matching" instead of "strategy stealing" (which has a strong connotation of literal copying)?

Strategy stealing doesn't usually involve actual stealing, just using the hypothetical strategy the second player could have used.

How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn't the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI's strategy-finding procedure?

This is what alignment is supposed to give you---a procedure that works just as well as the unaligned AI strategy (e.g. by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those---this post is mostly about whether you should expect that to work. You could also use a different set that is equally useful because you are similarly matching the meta-level strategy for discovering useful facts about how to uncover information.)

Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.

Oh, didn't realize that it's an established technical term in game theory.

by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those

What I mean is that the unaligned AI isn't trying to "acquire influence", but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn't have a long-term / terminal goal, so it can't just "uses whatever procedure the unaligned AI originally used to find that strategy", at least not literally.

What I mean is that the unaligned AI isn't trying to "acquire influence", but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn't have a long-term / terminal goal, so it can't just "uses whatever procedure the unaligned AI originally used to find that strategy", at least not literally.

Yeah, that's supposed to be the content of the strategy-stealing assumption---that good plans for having a long-term impact can be translated into plans for acquiring flexible influence. I'm interested in looking at ways that can fail. (Alignment is the most salient to me.)

I'm still not sure I understand. Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence? If the latter, what kind of thing do you imagine it actually doing? For example is it trying to "find a strategy that’s instrumentally useful for a variety of long-term goals" as I guessed earlier? (It's hard for me to help "look for ways that can fail" when this picture isn't very clear.)

Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence?

The latter

It is trying to find a strategy that's instrumentally useful for a variety of long-term goals

It's presumably trying to find a strategy that's good for the user, but in the worst case where it understands nothing about the user it still shouldn't do any worse than "find a strategy that's instrumentally useful for a variety of long-term goals."

It’s presumably trying to find a strategy that’s good for the user

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user, but here "good for the user" seemingly has to be interpreted as "good in the long run".

In Universality and Security Amplification you said:

For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.

then in a later comment:

I think "what behavior is best according to those values" is never going to be robustly corrigible, even if you use a very good model of the user's preferences and optimize very mildly. It's just not a good question to be asking.

Do you see why I'm confused here? Is there a way to interpret "trying to find a strategy that’s good for the user" such that the AI is still corrigible?

For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.

Estimating values then optimizing those seems (much) worse than optimizing "what the user wants." One natural strategy for getting what the user wants can be something like "get into a good position to influence the world and then ask the user later."

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user

I don't have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).

This is very confusing because elsewhere you say that the kind of AI you're trying to design is just satisfying short-term preferences / instrumental values of the user, but here "good for the user" seemingly has to be interpreted as "good in the long run".

By "trying to find a strategy that's good for the user" I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.

I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).

I don't understand this statement, in part because I have little idea what "corrigibility to some other definition of value" means, and in part because I don't know why you bring up this distinction at all, or what a "strong view" here might be about.

By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.

What if the user fails to realize that a certain kind of resource is valuable? (By "resources" we're talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)

I don't understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for "strategy-stealing" that you forgot to list, or am I still misunderstanding something?

ETA: See also this earlier comment where I asked this question in a slightly different way.

What if the user fails to realize that a certain kind of resource is valuable? (By "resources" we're talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)

As long as the user and AI appreciate the arguments we are making right now, then we shouldn't expect it to do worse than stealing the unaligned AI's strategy. There is all the usual ambiguity about "what the user wants," but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user's view) by doing what others are doing.

(I think I won't have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won't make sense to readers in which case so it goes; it's also possible that I'm missing something.)

As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.

There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to "do better by doing what others are doing". For example suppose I'm playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there's nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).

I think I won’t have time to engage much on this in the near future

Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you'll have something concise and coherent to respond to when you get back to this.

Over the last year, I've thought a lot about human/AI power dynamics and influence-seeking behavior. I personally haven't used the strategy-stealing assumption (SSA) in reasoning about alignment, but it seems like a useful concept.

Overall, the post seems good. The analysis is well-reasoned and reasonably well-written, although it's sprinkled with opaque remarks (I marked up a Google doc with more detail). 

If this post is voted in, it might be nice if Paul gave more room to big-picture, broad-strokes "how does SSA tend to fail?" discussion, discussing potential commonalities between specific counterexamples, before enumerating the counterexamples in detail. Right now, "eleven ways the SSA could fail" feels like a grab-bag of considerations.

I think the strategy-stealing assumption is a great framework for analyzing what needs to be done to make AI go well such that I think that the ways in which the strategy-stealing assumption fail shed real light on the problems that we need to solve.

Promoted to curated: I think the strategy-stealing assumption is a pretty interesting conceptual building block for AI Alignment, and I've used it a bunch of times in the last two months. I also really like the structure of this post, and found it both pretty easy to understand, and to cover a lot of ground and considerations. 

(Logical) Time is of the essence

Achieving high expected value may require making highly consequential decisions quickly, where "quickly" is relative to the amount of computation we use (or something like that), not clock time. If this is true, then we can't afford to use up "logical time" or computation in a race with unaligned AI to capture resources while putting off these decisions. See following posts for some of the background ideas/intuitions:

  1. Beyond Astronomical Waste
  2. The “Commitment Races” problem
  3. In Logical Time, All Games are Iterated Games

My impression of commitment races and logical time is that the amount of computation we use in general doesn't matter; but that things we learn that are relevant to the acausal bargaining problems do matter. Concretely, using computation during a competitive period to e.g. figure out better hardware cooling systems should be innocuous, because it matters very little for bargaining with other civilisations. However, thinking about agents in other worlds, and how to best bargain with them, would be a big step forward in logical time. This would mean that it's fine to put off acausal decisions however long we want to, assuming that we don't learn anything that's relevant to them in the meantime.

More speculatively, this raises the issue of whether some things in the competitive period would be relevant for acausal bargaining. For example, causal bargaining with AIs on Earth could teach us something about acausal bargaining. If so, the competitive period would advance us in logical time. If we thought this was bad (which is definitely not obvious), maybe we could prevent it by making the competitive AI refuse to bargain with other worlds, and precommiting to eventually replacing it with a naive AI that hasn't updated on anything that the competitive AI has learned. The naive AI would be as early in logical time as we were, when we coded it, so it would be as if the competitive period never happened.

I expect this to become more true over time — I expect groups of agents with diverse preferences to eventually approach efficient outcomes, since otherwise there are changes that every agent would prefer (though this is not obvious, especially in light of bargaining failures).

This seems the same as saying that coordination is easy (at least in the long run), but coordination could be hard, especially cosmic coordination. Also, Robin Hanson says governance is hard, which would be a response to your response to 9. (I think I personally have uncertainty that covers both ends of this spectrum.)

(You can find a list of all 2019 Review poll questions here.)

I found this an interesting analysis, and would like to see it reviewed.

Planned summary:

We often talk about aligning AIs in a way that is _competitive_ with unaligned AIs. However, you might think that we need them to be _better_: after all, unaligned AIs only have to pursue one particular goal, whereas aligned AIs have to deal with the fact that we don't yet know what we want. We might hope that regardless of what goal the unaligned AI has, any strategy it uses to achieve that goal can be turned into a strategy for acquiring _flexible_ influence (i.e. influence useful for many goals). In that case, **as long as we control a majority of resources**, we can use any strategies that the unaligned AIs can use. For example, if we control 99% of the resources and unaligned AI controls 1%, then at the very least we can split up into 99 "coalitions" that each control 1% of resources and use the same strategy as the unaligned AI to acquire flexible influence, and this should lead to us obtaining 99% of the resources in expectation. In practice, we could do even better, e.g. by coordinating to shut down any unaligned AI systems.

The premise that we can use the same strategy as the unaligned AI, despite the fact that we need _flexible_ influence, is called the **strategy-stealing assumption**. Solving the alignment problem is critical to strategy-stealing -- otherwise, unaligned AI would have an advantage at thinking that we could not steal and the strategy-stealing assumption would break down. This post discusses **ten other ways that the strategy-stealing assumption could fail**. For example, the unaligned AI could pursue a strategy that involves threatening to kill humans, and we might not be able to use a similar strategy in response because the unaligned AI might not be as fragile as we are.

Planned opinion:

It does seem to me that if we're in a situation where we have solved the alignment problem, we control 99% of resources, and we aren't infighting amongst each other, we will likely continue to control at least 99% of the resources in the future. I'm a little confused about how we get to this situation though -- the scenarios I usually worry about are the ones in which we fail to solve the alignment problem, but still deploy unaligned AIs, and in these scenarios I'd expect unaligned AIs to get the majority of the resources. I suppose in a multipolar setting with continuous takeoff, if we have mostly solved the alignment problem but still accidentally create unaligned AIs (or some malicious actors create it deliberately), then this setting where we control 99% of the resources could arise.