Epistemic Status

I've made many claims in these posts. All views are my own.

Confident (75%). The theorems on power-seeking only apply to optimal policies in fully observable environments, which isn't realistic for real-world agents. However, I think they're still informative. There are also strong intuitive arguments for power-seeking.

Fairly confident (70%). There seems to be a dichotomy between "catastrophe directly incentivized by goal" and "catastrophe indirectly incentivized by goal through power-seeking", although Vika provides intuitions in the other direction.

Acknowledgements

After ~700 hours of work over the course of ~9 months, the sequence is finally complete.

This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Deep thanks to Rohin Shah, Abram Demski, Logan Smith, Evan Hubinger, TheMajor, Chase Denecke, Victoria Krakovna, Alper Dumanli, Cody Wild, Matthew Barnett, Daniel Blank, Sara Haxhia, Connor Flexman, Zack M. Davis, Jasmine Wang, Matthew Olson, Rob Bensinger, William Ellsworth, Davide Zagami, Ben Pace, and a million other people for giving feedback on this sequence.

Appendix: Easter Eggs

The big art pieces (and especially the last illustration in this post) were designed to convey a specific meaning, the interpretation of which I leave to the reader.

There are a few pop culture references which I think are obvious enough to not need pointing out, and a lot of hidden smaller playfulness which doesn't quite rise to the level of "easter egg".

Reframing Impact

The bird's nest contains a literal easter egg.

The paperclip-Balrog drawing contains a Tengwar inscription which reads "one measure to bind them", with "measure" in impact-blue and "them" in utility-pink.

"Towards a New Impact Measure" was the title of the post in which AUP was introduced.

Attainable Utility Theory: Why Things Matter

This style of maze is from the video game Undertale.

Seeking Power is Instrumentally Convergent in MDPs

To seek power, Frank is trying to get at the Infinity Gauntlet.

The tale of Frank and the orange Pebblehoarder

Speaking of under-tales, a friendship has been blossoming right under our noses.

After the Pebblehoarders suffer the devastating transformation of all of their pebbles into obsidian blocks, Frank generously gives away his favorite pink marble as a makeshift pebble.

The title cuts to the middle of their adventures together, the Pebblehoarder showing its gratitude by helping Frank reach things high up.

This still at the midpoint of the sequence is from the final scene of The Hobbit: An Unexpected Journey, where the party is overlooking Erebor, the Lonely Mountain. They've made it through the Misty Mountains, only to find Smaug's abode looming in the distance.

And, at last, we find Frank and orange Pebblehoarder popping some of the champagne from Smaug's hoard.

Since Erebor isn't close to Gondor, we don't see Frank and the Pebblehoarder gazing at Ephel Dúath from Minas Tirith.

New Comment
18 comments, sorted by Click to highlight new comments since:

I've updated the post with epistemic statuses:

  • AU theory describes how people feel impacted. I'm darn confident (95%) that this is true.
  • Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world. Confident (75%). The theorems on power-seeking only apply in the limit of farsightedness and optimality, which isn't realistic for real-world agents. However, I think they're still informative. There are also strong intuitive arguments for power-seeking.
  • CCC is true. Fairly confident (70%). There seems to be a dichotomy between "catastrophe directly incentivized by goal" and "catastrophe indirectly incentivized by goal through power-seeking", although Vika provides intuitions in the other direction.
  • AUP prevents catastrophe (in the outer alignment sense, and assuming the CCC). Very confident (85%).
  • Some version of AUP solves side effect problems for an extremely wide class of real-world tasks, for subhuman agents. Leaning towards yes (65%).
  • For the superhuman case, penalizing the agent for increasing its own AU is better than penalizing the agent for increasing other AUs. Leaning towards yes (65%).
  • There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense). Pessimistic (35%).

I am surprised by your conclusion that the best choice of auxiliary reward is the agent's own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.

I think that agent-reward-based AUP has completely different properties from AUP with random auxiliary reward(s). Firstly, it has the issues described by Rohin in this comment, which seem quite concerning to me. Secondly, I would expect it to perform poorly on SafeLife and other side effects environments. In this sense, it seems a bit misleading to include the results for AUP with random auxiliary rewards in this sequence, since they are unlikely to transfer to the version of AUP that you end up advocating for. Agent-reward-based AUP has not been experimentally validated and I do not expect it to work well in practice.

Overall, using agent reward as the auxiliary reward seems like a bad idea to me, and I do not endorse it as the "current-best definition" of AUP or the default impact measure we should be using. I am puzzled and disappointed by this conclusion to the sequence.

However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.

You seem to have misunderstood. Impact to a person is change in their AU. The agent is not us, and so it's insufficient for the agent to preserve its ability to do what we want – it has to preserve our ability to do we want!

The Catastrophic Convergence Conjecture says:

Unaligned goals tend have catastrophe-inducing optimal policies because of power-seeking incentives.

Logically framed, the argument is: catastrophe power-seeking (obviously, this isn't a tautology or absolute rule, but that's the structure of the argument). Attainable Utility Preservation: Concepts takes the contrapositive: no power-seeking no catastrophe.

Then, we ask – "for what purpose does the agent gain power?". The answer is: for its own purpose. Of course.[1]

One of the key ideas I have tried to communicate is: AUP does not try to look out into the world and directly preserve human values. AUP penalizes the agent for gaining power, which disincentivizes huge catastrophes & huge decreases in our attainable utilities.

I would expect it to perform poorly on SafeLife and other side effects environments.

I agree it would perform poorly, but that's because the CCC does not apply to SafeLife. We don't need to worry about the agent gaining power over other agents. Instead, the agent can be viewed as the exclusive interface through which we can interact with a given SafeLife level, so it should preserve our AU by preserving its own AUs. Where exactly is this boundary drawn? I think that's a great question.

In this sense, it seems a bit misleading to include the results for AUP with random auxiliary rewards in this sequence, since they are unlikely to transfer to the version of AUP that you end up advocating for.

I disagree. I clearly distinguish between the versions.

Agent-reward-based AUP has not been experimentally validated

Incorrect. It would be fair to say that it hasn't been thoroughly validated.

I am puzzled and disappointed

Suppose you're attending a lecture given by another expert in your field. After prefacing that they spent many, many hours preparing the lecture because they previously had trouble communicating the work, they say something that sounds weird. A good rule of thumb is to give the benefit of the doubt and ask for clarification – why might they believe this? Do I understand what they mean? – before speaking up to disagree.

Edited for clarity.


  1. There are many legitimate objections to particular formalizations of this idea, many of which I make myself. That's different from objecting to AUP. ↩︎

Thank you for the clarifications! I agree it's possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.

Here are some reasons I don't endorse this approach:

1. I have an intuitive sense that defining the auxiliary reward in terms of the main reward results in a degenerate incentive structure that directly pits the task reward and the auxiliary reward against each other. As I think Rohin has pointed out somewhere, this approach seems likely to either do nothing or just optimize the reward function, depending on the impact penalty parameter, which result in a useless agent.

2. I share Rohin's concerns in this comment that agent-reward AUP is a poor proxy for power and throws away the main benefits of AUP. I think those concerns have not been addressed (in your recent responses to his comment or elsewhere).

3. Unlike AUP with random rewards, which can easily be set to avoid side effects by penalizing decreases, agent-reward AUP cannot avoid side effects even in principle. I think that the ability to avoid side effects is an essential component of a good impact measure.

Incorrect. It would be fair to say that it hasn't been thoroughly validated.

As far as I can tell from the Scaling to Superhuman post, it has only been tested on the shutdown gridworld. This is far from sufficient for experimental validation. I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn't just optimize the reward (to address the concern in point 1).

I agree it would perform poorly, but that's because the CCC does not apply to SafeLife.

Not sure what you mean by the CCC not applying to SafeLife - do you mean that it is not relevant or that doesn't hold in this environment? I get the sense that it doesn't hold, which seems concerning. If I only care about green life patterns in SafeLife, the fact that the agent is not seeking power is cold comfort to me if it destroys all the green patterns. This seems like a catastrophe if I can't create any green patterns once they are gone, so my ability to get what I want is destroyed.

Sorry if I seem overly harsh or dismissive - I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.

Here are some reasons I don't endorse this approach:

I think this makes sense – you come in and wonder "what's going on, this doesn't even pass the basic test cases?!".

Some context: in the superintelligent case, I often think about "what agent design would incentivize putting a strawberry on a plate, without taking over the world"? Although I certainly agree SafeLife-esque side effects are important, power-seeking might be the primary avenue to impact for sufficiently intelligent systems. Once a system is smart enough, it might realize that breaking vases would get it in trouble, so it avoids breaking vases as long as we have power over it.

If we can't deal with power-seeking, then we can't deal with power-seeking & smaller side effects at the same time. So, I set out to deal with power-seeking for the superintelligent case.

Under this threat model, the random reward AUP penalty (and the RR penalty AFAICT) can be avoided with the help of a "delusion box" which holds the auxiliary AUs constant. Then, the agent can catastrophically gain power without penalty. (See also: Stuart's subagent sequence)

I investigated whether we can get an equation which implements the reasoning in my first comment: "optimize the objective, without becoming more able to optimize the objective". As you say, I think Rohin and others have given good arguments that my preliminary equations don't work as well as we'd like. Intuitively, though, it feels like there might be a better way to implement that reasoning.

I think the agent-reward equations do help avoid certain kinds of loopholes, and that they expose key challenges for penalizing power seeking. Maybe going back to the random rewards or a different baseline helps overcome those challenges, but it's not clear to me that that's true.

I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn't just optimize the reward (to address the concern in point 1).

I'm pretty curious about that – implementing eg Stuart's power-seeking gridworld would probably make a good project for anyone looking to get into AI safety. (I'd do it myself, but coding is hard through dictation)

Not sure what you mean by the CCC not applying to SafeLife - do you mean that it is not relevant or that doesn't hold in this environment? I get the sense that it doesn't hold, which seems concerning.

I meant that it isn't relevant to this environment. In the CCC post, I write:

"But what about the Blackwell-optimal policy for Tic-Tac-Toe? These agents aren't taking over the world now". The CCC is talking about agents optimizing a reward function in the real world (or, for generality, in another sufficiently complex multiagent environment).

This sequence doesn't focus on other kinds of environments, so there's probably more good thinking to do about what I called "interfaces".

I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.

That makes sense. I'm only speaking for myself, after all. For the superintelligent case, I am slightly more optimistic about approaches relying on agent-reward. I agree that those approaches are wildly inappropriate for other classes of problems, such as SafeLife.

Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.

One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI's ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn't seem right.

I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I'm not sure what it would look like.

Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don't need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:

1. Put a strawberry on a plate, without taking over the world

2. Put a strawberry on a plate while taking over the world

3. Do nothing

I think you'd want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.

I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don't apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post "AUP: Scaling to Superhuman" does not suggest to me that this post introduces a new approach. The term "scaling" usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying "we are pretty close to the impact measurement endgame" seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.

Starting this post by saying "we are pretty close to the impact measurement endgame" seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.

What I actually said was:

I think we're plausibly quite close to the impact measurement endgame

First, the "I think", and second, the "plausibly". I think the "plausibly" was appropriate, because in worlds where the CCC is true and you can just straightforwardly implement AUP ("optimize the objective, without becoming more able to optimize the objective"), you don't need additional ideas to get a superintelligence-safe impact measure.

Some thoughts on this discussion:

1. Here's the conceptual comment and the math comment where I'm pessimistic about replacing the auxiliary set with the agent's own reward.

However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.

Then

You seem to have misunderstood. Impact to a person is change in their AU. The agent is not us, and so it's insufficient for the agent to preserve its ability to do what we want – it has to preserve our ability to do we want!

Hmm, I think you're misunderstanding Vika's point here (or at least, I think there is a different point, whether Vika was saying it or not). Here's the argument, spelled out in more detail:

1. Impact to an arbitrary agent is change in their AU.

2. Therefore, to prevent catastrophe via regularizing impact, we need to have an AI system that is penalized for changing a human's AU.

3. By assumption, the AI's utility function is different from the human's (otherwise there wouldn't be any problem).

4. We need to ensure can pursue , but we're regularizing pursuing . Why should we expect the latter to cause the former to happen?

One possible reason is there's an underlying factor which is how much power has, and as long as this is low it implies that any agent (including ) can pursue their own reward about as much as they could in 's absence (this is basically CCC). Then, if we believe that regularizing pursuing keeps 's power low, we would expect it also means that remains able to pursue . I don't really believe the premise there (unless you regularize so strongly that the agent does nothing).

if we believe that regularizing A pursuing keeps A's power low

I don't really believe the premise

with respect to my specific proposal in the superintelligent post, or the conceptual version?

Specific proposal.

If the conceptual version is "we keep A's power low", then that probably works.

If the conceptual version is "tell A to optimize R without becoming more able to optimize R", then I have the same objection.

Why do you object to the latter?

I don't know what it means. How do you optimize for something without becoming more able to optimize for it? If you had said this to me and I hadn't read your sequence and so knew what you were trying to say, I'd have given you a blank stare -- the closest thing I have to an interpretation is "be myopic / greedy", but that limits your AI system to the point of uselessness.

Like, "optimize for X" means "do stuff over a period of time such that X goes up as much as possible". "Becoming more able to optimize for X" means "do a thing such that in the future you can do stuff such that X goes up more than it otherwise would have". The only difference between these two is actions that you can do for immediate reward.

(This is just saying in English what I was arguing for in the math comment.)

if you're managing a factory, I can say "Rohin, I want you to make me a lot of paperclips this month, but if I find out you've increased production capacity or upgraded machines, I'm going to fire you". You don't even have to behave greedily – you can plan for possible problems and prevent them, without upgrading your production capacity from where it started.

I think this is a natural concept and is distinct from particular formalizations of it.

edit: consider the three plans

  1. Make 10 paperclips a day
  2. Make 10 paperclips a day, but take over the planet and control a paperclip conglomerate which could turn out millions of paperclips each day, but which in fact never does.
  3. take over the planet and make millions of paperclips each day.

Seems like that only makes sense because you specified that "increasing production capacity" and "upgrading machines" are the things that I'm not allowed to do, and those are things I have a conceptual grasp on. And even then -- am I allowed to repair machines that break? What about buying a new factory? What if I force workers to work longer hours? What if I create effective propaganda that causes other people to give you paperclips? What if I figure out that by using a different source of steel I can reduce the defect rate? I am legitimately conceptually uncertain whether these things count as "increasing production capacity / upgrading machines".

As another example, what does it mean to optimize for "curing cancer" without becoming more able to optimize for "curing cancer"?

Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there's something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you. 

Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

I'm very glad you enjoyed it! 

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

I'd say so, yes. 

I think this has close connection to the CIRL/Human Compatible view that we need the GAI to model its own uncertainty about the true human utility function that we want optimized. Impact is rather similar to the GAI asking 'If my most favored collection of models about what I should in fact be doing were wrong, and one of the many possibilities that I currently consider unlikely were in fact correct, then how bad would the consequences of my action be?', i.e. asking "What does the left tail of my estimated distribution of possible utilities for this outcome look like?" -- which we should always be doing if optimizing over a large number of outcomes, for look-elsewhere/P-hacking reasons. I think you can get a pretty good definition of Impact by asking "if my favored utility models were all incorrect, how bad could that be, according to many other utility models that I believe are are unlikely but not completely ruled out by my current knowledge about what humans want?" That suggests that even if you're 99% sure blowing up the world in order to make a few more paperclips is a good idea, that alternative set of models of what humans want (that you have only a 1% belief in) collectively screaming "NO, DON'T DO IT!" are a good enough reason not to. In general, if you're mistaken and accumulate a large amount of power, you will do a large amount of harm. So I think the CIRL/Human Compatible framework automatically incorporates something that looks like a form of Impact.

A relevant fact about Impact is that human environments have already been heavily optimized for their utility to humans by humans. So if you make large, random, or even just mistaken changes to them, it is extremely likely that you are going to decrease their utility rather than increase it. In a space that is already heavily optimized, it is far easier to do harm than good. So when hypothesizing about the utility of any state that is well outside the normal distribution of states in human environments, it is a very reasonable Bayesian prior that its utility is much lower than states you have observed in human environments, and it is also a very reasonable Bayesian prior that if you think it's likely to be high, you are probably mistaken. So for a Value Learning system, a good Bayesian prior for the distribution your estimate of the true unknown-to-you human utility score of states of the world you haven't seen around humans is that they should have fat tails on the low side, and not on the high side. There are entirely rational reasons for acting with caution in an already-heavily-optimized environment.