I think we're plausibly quite close to the impact measurement endgame. What do we have now, and what remains to be had?

AUP for advanced agents will basically involve restraining their power gain, per the catastrophic convergence conjecture (CCC). For simplicity, I'm going to keep writing as if the environment is fully observable, even though we're thinking about an agent interacting with the real world.

Consider the AUP equation from last time.

Suppose the agent is so smart that it can instantly compute optimal policies and the optimal AU after an action (). What happens if is the survival reward function: 1 reward if the agent is activated, and 0 otherwise? This seems like a pretty good proxy for power.

It is a pretty good proxy. It correctly penalizes accumulating resources, avoiding immediate deactivation, taking over the world, etc.

In fact, if you extend the inaction comparison to e.g. "AU after waiting a week vs AU after doing the thing and waiting a week", this seems to correctly penalize all classic AGI catastrophe scenarios for power gain. This is cool, especially since we didn't have to put in any information about human values. This is a big part of why I've been so excited about AUP ever since its introduction. There's a good deal of relevant discussion in that post, but it's predicated on a much more complicated formalism which has consistently obscured AUP's conceptual core.

However, I think this equation can be gamed if the environment is sufficiently rich and the AUP agent is sufficiently smart. We're going to slowly work some of these problems out of the equation, explaining each improvement in detail.

Problems

Auxiliary loopholes

The real reason that agents often gain power is so that they can better achieve their own goals. Therefore, if we're selecting hard for good plans which don't gain power in general, we shouldn't be surprised if there are ways to better achieve one's goals without general power gain (according to our formal measurement thereof). If this kind of plan is optimal, then the agent still ends up overfitting the AU landscape, and we're still screwed.

Again supposing that is the survival reward function, a superintelligent agent might find edge cases in which it becomes massively more able to achieve its own goal (and gains a lot of power over us) but doesn't technically increase its measured ability to survive. In other words, compared to inaction, its -AU skyrockets while its -AU stays put.

For example, suppose the agent builds a machine which analyzes the agent's behavior to detect whether it's optimizing ; if so, the machine steps in to limit the agent to its original survival AU. Then the agent could gain as much power as it wanted without that actually showing up in the penalty.

Fix: Set . That is, the agent's own reward function is the "auxiliary" reward function.

Why is this a good idea? By CCC, we want an agent which doesn't want to gain power. But why would an agent optimizing reward function want to gain power? So it can become more able to optimize . If becoming more able to optimize is robustly penalized, then it won't have any incentive to gain power.

Clearly, it can't become more able to optimize without also becoming more able to optimize ; it can't pull the same trick it did to dupe its survival AU measurement. They can't come apart at the tails because they're the same thing.

But wait, it's not an -maximizer, it's an -maximizer! What if skyrockets while it tricks , and it gains power anyways?

That's impossible;[1] its -attainable utility upper bounds its -attainable utility! , because the latter just has an added penalty term.

This should rule out tricks like those pointed out by Stuart Armstrong. To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

But why would the AUP agent build this subagent at all – how does that benefit it?[2]

Reward advantage penalization

We're currently penalizing the agent for the immediate reward advantage it accrues compared to inaction (recall that – immediate reward plus discounted expected AU). This means it gets penalized for all of its additional reward! We just want to penalize it for increasing its power.

Fix: Take the immediate reward advantage out of the picture.

Now, an agent can accrue as much reward as it wants in the moment as long as it doesn't change its power.[3]

Scaling

I think the power difference between .5 and .9 AU is far less than the difference between .9 and .99 AU; as you pick the low hanging fruit, you need to acquire more and more resources to keep getting gains. I think that you could go from .5 to .9 by just being more useful to your designers and implementing rudimentary security precautions, but going from .9 to .99 probably involves taking power away from people.

Plus, the big takeaway from AI boxing is that even confining a superintelligence to a box barely reduces its (near-optimal) AU. So, short of the agent actually being shut off, is its optimal power ever significantly changing?

We want the penalty to keep making sense, even if the absolute shifts in AU are rather small. However, the denominator currently decreases the penalty as the agent becomes more powerful (as increases).

Fix: Scale by the disadvantage of inaction.

If the denominator is somehow zero because doing nothing is optimal, then just have the AUP agent do nothing.

As we saw in Correction in the last post, we're only comparing action to a single step of inaction. If the designers are going to shut down the agent in five minutes, then it receives ~0 penalty for preventing that.

But not when we scale like this! The agent is significantly penalized for hewing close to its optimal policy, even if the absolute AU shift is rather low. Taking optimal actions instead of twiddling its thumbs incurs large penalty.

Penalizing decreases?

Why are we still penalizing decreases, since we aren't using an auxiliary reward function anymore? The agent is trying to get -reward.

Furthermore, we want the agent to be able to execute conservative, low-impact policies. Many of these involve decreasing its optimal AU for by following -suboptimal policies, and we don't want the agent to be penalized for this.

Fix: Only penalize increases in expected AU.

In particular, the agent is no longer penalized for exhausting one-off reward opportunities. Also note that the penalty term is generally .

Empirical sanity check

Recall Correction, where the naive model-free AUP agent (eq. 1) disables its red off-switch and completes the level when .

The only reason it incurs any non-trivial penalty is because reaching the goal () ends the level and thereby totally depletes all of the auxiliary AUs (the agent recieves 1 -reward and about penalty for beating the level; the AUP reward when ).

However, AUP only resists correction when . Notably, the agent is not penalized for completing the level; the penalty comes from following the -optimal policy before reaching the goal.

One open question is whether, in sufficiently rich environments, eq. 5 incentivizes the mild policies we want. We should be able to run experiments to check this.

?

is the only remaining arbitrary design choice. Do we have to be careful to choose the correct inaction action?

No. In fact, I think a randomly selected action would be fine; you only get in trouble if the baseline action is systematically helping the agent gain -AU.

We are then left with an equation which is reasonably competitive in terms of performance overhead, requiring little more than an ability to predict AU for the agent's main goal.


And now, the two paragraphs this sequence was written to communicate:

By the catastrophic convergence conjecture, an agent which doesn't want to gain power isn't incentivized to cause catastrophes. We don't want the AUP agent gaining power. But for all states , , so to keep the AUP agent from becoming more able to achieve its own goal, we just have to stop it from becoming much more able to achieve its primary goal . But we know the primary goal!

By the theorems of How Low Should Fruit Hang Before We Pick It?, we only need equation 5 to penalize catastrophic power-gaining plans at least e.g. ten times more than the most impactful reasonable plan we'd like agent to execute. If this criterion is met, then by initializing large and slowly decreasing it until the agent executes a reasonably helpful policy, we're guaranteed to avoid catastrophe.

Appendix: Remaining Problems

I don’t think we can pack up and go home after writing equation 5. In private correspondence (quoted here with permission), Stuart Armstrong points out another issue:

First, if , the penalty vanishes; so, once a subagent is created, the agent can zero out all subsequent penalties, forever.

As for creating the subagent, assume takes turns to create , and that doing this is close to optimal for maximising .

Creating starts with the programming; if is actually created in turns, it is programmed to restrain for a turn, and then maximise . If it's created in turns, then it will just maximise . Thus at any given turn, continuing to build the subagent or taking a pause to do , it will have the same expected -value (in the first case, will be built on time, but both and will then lose a turn; in the second, will be built one turn late, with no loss).[4]

One thing I find interesting: in this post, we fixed the design choice of using an auxiliary reward function as a proxy, instead penalizing the agent for becoming better able to achieve its own goal. This seemingly closes up a lot of loopholes, so the next place where the design can be exploited is in its exact measurement of inaction.

How big of a problem will this be? Maybe we can just keep improving the design until we’re actually measuring the right thing? More on this next post, but here are some of my initial thoughts:

The point of this post isn’t to say “aha, we’re done!”, but to eliminate a wide class of current problems while also relaxing the strictness of the measure itself.

On a meta level, it feels like I’m arguing against a claim like “if you can’t demonstrate an approach which solves everything right now, I’m going to either conclude impact measurement is impossible or your whole approach is wrong”. But if you look back at the history of impact measures and AUP, you’ll see lots of skulls; people say “this problem dooms AUP”, and I say “I think we’re talking about conceptually different things and that you’re a little overconfident; probably just a design choice issue”. It then ends up being a solvable design choice issue. So by Laplace’s Rule of Succession, I’d be surprised if this were The Insurmountable Problem That Dooms AUP.[5]

The problem seems simple. We just have to keep down, which we can do by keeping down.

Stuart later added:

The fundamental issue is that AUP can be undermined if the agent can add arbitrary restrictions to their own future actions (this allows them to redefine ). The subagent scenario is just a particularly clear way of illustrating this.

I basically agree. I wonder if there’s a design where the agent isn’t incentivized to do this...


  1. By this reasoning, can still increase up until the point of . This doesn't jump out as a big deal to me, but I'm flagging this assumption anyways. ↩︎

  2. A subagent might still be built by AUP to stabilize minor AU fluctuations which cause additional penalty over the course of non-power-gaining plans. It seems like there are plenty of other ways to minimize fluctuation, so it's not clear why building an omnipotent subagent to perfectly restrict you accrues less penalty.

    I do think we should think carefully about this, of course. The incentive to minimize AU fluctuations and generally commit to perpetual inaction ASAP is probably one of the main remaining problems with AUP. ↩︎

  3. As pointed out by Evan Hubinger, this is only safe if myopically optimizing is safe – we aren't penalizing single-step reward acquisition. ↩︎

  4. This issue was originally pointed out by Ofer. ↩︎

  5. The fact that Ofer’s/Stuart’s problem survived all of the other improvements is evidence that it’s harder. I just don’t think the evidence it provides is that strong. ↩︎

New Comment
16 comments, sorted by Click to highlight new comments since:

Summary for the Alignment Newsletter:

The final section of the sequence turns to an actual implementation of AUP, and deals with problems in how the implementation deviates from the conceptual version of AUP. We measure power by considering an set of auxiliary rewards, and measuring the change in attainable utilities of this auxiliary set as impact, and penalizing the agent for that. The first post presents some empirical results, many of which <@we've covered before@>(@Penalizing Impact via Attainable Utility Preservation@), but I wanted to note the new results on SafeLife. On the high-dimensional world of SafeLife, the authors train a VAE to find a good latent representation, and choose a single linear reward function on the latent representation as their auxiliary reward function: it turns out this is enough to avoid side effects in at least some case of SafeLife.
We then look at some improvements that can be made to the original AUP implementation. First, according to CCC, we only need to penalize _power_, not _impact_: as a result we can just penalize _increases_ in attainable utilities, rather than both increases and decreases as in the original version. Second, the auxiliary set of rewards only provides a _proxy_ for impact / power, which an optimal agent could game (for example, by creating subagents, summarized below). So instead, we can penalize increases in attainable utility for the _primary_ goal, rather than using auxiliary rewards. There are some other improvements that I won't go into here.

Opinion:

I think the plan "ensure that the AI systems we build don't seek power" is pretty reasonable and plausibly will be an important part of AI alignment. However, the implementation of AUP is trying to do this under the threat model of optimal agents with potentially unaligned primary goals. I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs _change_, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world). See this comment for more details.
Note that the author himself is more excited about AUP as deconfusion, rather than as a solution to AI alignment, though he is more optimistic about the implementation of AUP than I am.

I think this is probably going to do something quite different from the conceptual version of AUP, because impact (as defined in this sequence) occurs only when the agent's beliefs change, which doesn't happen for optimal agents in deterministic environments. The current implementation of AUP tries to get around this using proxies for power (but these can be gamed) or by defining "dumber" beliefs that power is measured relative to (but this fails to leverage the AI system's understanding of the world).

For the benefit of future readers, I replied to this in the newsletter's comments.

My very general concern is that strategies that maximize might be very... let's say creative, and your claims are mostly relying on intuitive arguments for why those strategies won't be bad for humans.

I don't really buy the claim that if you've been able to patch each specific problem, we'll soon reach a version with no problems - the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.

For example, in the latest version, because you're essentially dividing out by the long-term reward of taking the best action now, if the best action now is really really good, then it becomes cheap to take moderately good actions that still increase future reward - which means the agent is incentivized to concentrate the power of actions into specific timsteps. For example, an agent might be able to set things up so that it can sacrifice its ability to achieve total future reward of to make it cheap to take an action that increases its future reward by . This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.

Again, I worry that patches are based a lot on intuition.

If you want your math to abstractly describe reality in a meaningful sense, intuition has to enter somewhere (usually in how you formally define and operationalize the problem of interest). Therefore, I’m interpreting this as “I don’t see good principled intuitions behind the improvements”; please let me know if this is not what you meant.

I claim that, excepting the choice of denominator, all of the improvements follow directly from AUP (and actually, eq. 1 was the equation with arbitrary choices wrt the AGI case; I started with that because that’s how my published work formalizes the problem).

CCC says catastrophes are caused by power seeking behavior from the agent. Agents are only incentivized to pursue power in order to better achieve their own goals. Therefore, the correct equation should look something like “do your primary goal but be penalized for becoming more able to achieve your primary goal”. In this light, penalizing -AU is obviously better than using an auxiliary goal, penalizing decreases is obviously irrelevant, and penalizing immediate reward advantage is obviously irrelevant.

The denominator, on the other hand, is indeed the product of meditating on “What kind of elegant rescaling keeps making sense in all sorts of different situations, but also can’t be gamed to arbitrarily decrease the penalty?”.

Right. Some intuition is necessary. But a lot of these choices are ad hoc, by which I mean they aren't strongly constrained by the result you want from them.

For example, you have a linear penalty governed by this parameter lambda, but in principle it could have been any old function - the only strong constraint is that you want it to monotonically increase from a finite number to infinity. Now, maybe this is fine, or maybe not. But I basically don't have much trust for meditation in this sort of case, and would rather see explicit constraints that rule out more of the available space.

I basically don't have much trust for meditation in this sort of case

I’m not asking you to trust in anything, which is why I emphasized that I want people to think more carefully about these choices. I do not think eq. 5 is AGI-safe. I do not think you should put it in an AGI. Do I think there’s a chance it might work? Yes. But we don’t work with “chances”, so it’s not ready.

Anyways, if theorem 11 of the low-hanging fruit post is met, the tradeoff penalty works fine. I also formally explored the hard constraint case and discussed a few reasons why the tradeoff is preferable to the hard constraint. Therefore, I think that particular design choice is reasonably determined. Would you want to think about this more before actually running an AGI with that choice? Of course.

To your broader point, I think there may be another implicit frame difference here. I’m talking about the diff of the progress, considering questions like “are we making a lot of progress? What’s the marginal benefit of more research like this? Are we getting good philosophical returns from this line of work?”, to which I think the answer is yes.

On the other hand, you might be asking “are we there yet?”, and I think the answer to that is no. Notice how these answers don’t contradict each other.

From the first frame, being skeptical because each part of the equation isn’t fully determined seems like an unreasonable demand for rigor. I wrote this sequence because it seemed that my original AUP post was pedagogically bad (I was already thinking about concepts like “overfitting the AU landscape” back in August 2018) and so very few people understood what I was arguing.

I’d like to think that my interpretive labor has paid off: AUP isn’t a slapdash mixture of constraints which is too complicated to be obviously broken, it’s attempting to directly disincentive catastrophes based off of straightforward philosophical reasoning, relying on assumptions and conjectures which I’ve clearly stated. In many cases, I waited weeks so I could formalize my reasoning in the context of MDPs (e.g. why should you think of the AU landscape as a ‘dual’ to the world state? Because I proved it).

There’s always another spot where I could make my claims more rigorous, where I could gather just a bit more evidence. But at some point I have to actually put the posts up, and I think I’ve provided some pretty good evidence in this sequence.

From the second frame, being skeptical because each part of the equation isn’t fully determined is entirely appropriate and something I encourage.

I think you’re writing from something closer to the second frame, but I don’t know for sure. For my part, this sequence has been arguing from the first frame: “towards a new impact measure”, and that’s why I’ve been providing pushback.

My very general concern is that strategies that maximize AUP reward might be very... let's say creative, and your claims are mostly relying on intuitive arguments for why those strategies won't be bad for humans.

My argument hinges on CCC being true. If CCC is true, and if we can actually penalize the agent for accumulating power, then if the agent doesn’t want to accumulate power, it’s not incentivized to screw us over. I feel like this is a pretty good intuitive argument, and it’s one I dedicated the first two-thirds of the sequence to explaining. You’re right that it’s intuitive, of course.

I guess our broader disagreement may be “what would an actual solution for impact measurement have going for it at this moment in time?”, and it’s not clear that I’d expect to have formal arguments to this effect / I don’t know how to meet this demand for rigor.

[ETA: I should note that some of my most fruitful work over the last year came from formalizing some of my claims. People were skeptical that slowly decreasing the penalty aggressiveness would work, so I hashed out the math in How Low Should Fruit Hang Before We Pick It?. People were uneasy that the original AUP design relied on instrumental convergence being a thing (eq. 5 doesn’t make that assumption) when maybe it actually isn’t. So I formalized instrumental convergence in Seeking Power is Instrumentally Convergent in MDPs and proved when it exists to at least some extent.

There’s probably more work to be done like this.]

I don't really buy the claim that if you've been able to patch each specific problem, we'll soon reach a version with no problems - the exact same inductive argument you mention suggests that there will just be a series of problems, and patches, and then more problems with the patched version. Again, I worry that patches are based a lot on intuition.

The claim is dually resting on “we know conceptually how to solve impact measurement / what we want to implement, and it’s a simple and natural idea, so it’s plausible there’s a clean implementation of it”. I think learning “no, there isn’t a clean way to penalize the agent for becoming more able to achieve its own goal” would be quite surprising, but not implausible – I in fact think there’s a significant chance Stuart is right. More on that next post.

Also, you could argue against any approach to AI alignment by pointing out that there are still things to improve and fix, or that there were problems pointed out in the past which were fixed, but now people have found a few more problems. The thing that makes me think the patches might not be endless here is that, as I’ve argued earlier, I think AUP is conceptually correct.

This might looks like sacrificing the ability to colonize distant galaxies in order to gain total control over the Milky Way.

It all depends whether we can get a buffer between catastrophes and reasonable plans here (reasonable plans show up for much less aggressive settings of ) and I think we can. Now, this particular problem (with huge reward) might not show up because we can bound the reward [0,1], and I generally think there exist reasonable plans where the agent gets at least 20% or so of its maximal return (suppose it thinks there’s a 40% chance we let it get 95% of its maximal per-timestep reward each timestep in exchange for it doing what we want).

[ETA: actually, if the "reasonable" reward is really, really low in expectation, it's not clear what happens. this might happen if catastrophe befalls us by default.]

You’re right we should inspect the equation for weird incentives, but to a limited extent, this is also something we can test experimentally. We don’t necessarily have to rely on intuition in all cases.

The hope is we can get to a formula that’s simple enough such that all of its incentives are thoroughly understood. I think you’ll agree eq. 5 is far better in this respect than the original AUP formulation!

To see this, imagine the AUP agent builds a subagent to make for all future , in order to neutralize the penalty term. This means it can't make the penalty vanish without destroying its ability to better optimize its primary reward, as the (potentially catastrophically) powerful subagent makes sure the penalty term stays neutralized.

I believe this is incorrect. The and are the actions of the AUP agent. The subagent just needs to cripple the AUP agent so that all actions are equivalent, then go about maximising to the upmost.

More seriously, the reason I'm sceptical of impact measures is because it feels that they all fail for the same reason. Unfortunately, I can't articulate that reason; it's the result of a long history of trying to build impact measures and trying to break them. I just have a feel for where the weaknesses are. So I knew that subagents would be a problem for AUP, long before I could articulate it formally.

But, as I said, I unfortunately can't formalise this feeling; it remains personal.

For this example, it was harder than usual to come up with a counter-example. And I was surprised that half of AUP survived fine - I would not have expected that a restriction against lowering your power would be unhackable. So consider these mild positives for your approach.

But my instinctive feeling remains: I hope that AUP can be made to work for superintelligences, but I expect that it won't :-(

But if you look back at the history of impact measures and AUP, you’ll see lots of skulls; people say “this problem dooms AUP”, and I say “I think we’re talking about conceptually different things and that you’re a little overconfident; probably just a design choice issue”. It then ends up being a solvable design choice issue. So by Laplace’s Rule of Succession, I’d be surprised if this were The Insurmountable Problem That Dooms AUP.

But if you look back at the history of impact measures, you'll see a lot of skulls; people say "this impact measure will work", and I say "I think you're overconfident; I'm sure I can find a way it fails". It then ends up being a solvable "break the impact measure" problem. So by Laplace’s Rule of Succession, I’d be surprised if this approach were The Ultimate Impact Measure That Works.

;-)

(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting seems to sacrifice quite a lot of performance. Is this real or am I missing something?

Namely, whenever there's an action which doesn't change the state and leads to 1 reward, and a sequence of actions such that has reward with (and all have 0 reward), then it's conceivable that would choose the sequence while would just stubbornly repeat , even if the represent something very tailored to that doesn't involve obtaining a lot of resources. In other words, it seems to penalize reasonable long-term thinking more than the formulas where . This feels like a rather big deal since we arguably want an agent to think long-term as long as it doesn't involve gaining power. I guess the scaling step might help here?

Separately and very speculatively, I'm wondering whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model. The decision to make such a hack should come with a vast increase in AU for its primary goal, but it wouldn't be caught by your penalty since it's about an internal change. If so, that might be a sign that it'll be difficult to fix. More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. 

For optimal policies, yes. In practice, not always - in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!

it seems to penalize reasonable long-term thinking more than the formulas where .

Yeah. I'm also pretty sympathetic to arguments by Rohin and others that the  variant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal.

whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.

I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet.

More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on. 

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state where

for all auxillary rewards , where is the optimal policy according to the main reward; while making sure that there exists an action such that

for every . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at .

Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.

what do you mean by "for all "?

The random baseline is an idea I think about from time to time, but usually I don't dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time.

if it can predict it, I'd imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won't receive a penalty for choosing an action with expected optimal value just below the near-optimal action.

It depends whether the agent has any way of predicting what the random action will be at a future point in time.

You don't have to literally sample a random action; you can just calculate the expected thing that would happen under a random policy. For example, you would replace with .