Clarifying Power-Seeking and Instrumental Convergence

TurnTrout

Previously: Seeking Power is Provably Instrumentally Convergent in MDPs

Rohin Shah and Vanessa Kosoy pointed out a subtle problem with my interpretation of the power-seeking theorem from the last post. To understand the distinction, we first need to shore up some intuitions.

Correcting pre-formal intuitions about instrumental convergence

Imagine you're able to either attend college and then choose one of two careers, or attend a trade school for a third career. If you wanted, you could also attend college after trade school.

If every way of rewarding careers is equally likely, then $\frac{2}{3}$ of the time, you just go to college straight away. This is true even though going to trade school increases your power (your ability to achieve goals in general) compared to just going to college. That is, Power( $trade school$ ) > Power( $college$ ), but going to $college$ is instrumentally convergent.

We define instrumental convergence as optimal agents being more likely to take one action than another at some point in the future.

I think this captures what we really meant when we talked about instrumental convergence. Recently, however, an alignment researcher objected that instrumental convergence shouldn't depend on what state the world is in. I think the intuition was that Basic AI Drives-esque power-seeking means the agent should always seek out the powerful states, no matter their starting point.

I think this is usually true, but it isn't literally true. Sometimes states with high power are just too out-of-the-way! If you buy my formalization of power, then in what way is going to $trade school$ "instrumentally convergent"? It isn't optimal for most goals!

This suggests that naive intuitions about instrumental convergence are subtly wrong. To figure out where optimal policies tend to go, you must condition on where they come from. In other words, the best course of action depends on where you start out.

Correcting the last post's implications

Unfortunately, the above example kills any hope of a theorem like "the agent seeks out the states in the future with the most resources / power". The nice thing about this theorem would be that we just need to know a state has more resources in order for the agent to pursue it.

Everything should add up to normalcy, though: we should still be able to make statements like "starting from a given state, the agent tends to seek out states which give it more control over the future". This isn't quite what my current results show. For involved technical reasons,^[1] one of the relationships I showed between power and instrumental convergence is a bit tautological, with both sides of the equation implicitly depending on the same variable. Accordingly, I'll be softening the language in the previous post for the moment.

I think there's a pretty good chance the theorem we're looking for exists in full generality ("starting from a given state, the agent tends to seek out states which give it more control over the future"). However, maybe it doesn't, and the relationships I gave are the best we can get in general. I do think the Tic-Tac-Toe reasoning from the last post is a strong conceptual argument for power-seeking being instrumentally convergent, but a few technicalities stop it from being directly formalized.

Failure to prove power-seeking in full generality would mostly affect the presentation to the broader AI community; we'd just be a little less aggressive in the claims. I think a reasonable reader can understand how and why power-seeking tends to happen, and why it doesn't go away just because some of the cycles aren't self-loops, or something silly like that.

In summary, the power-seeking theorem wasn't as suggestive as I thought. I'm still excited about this line of inquiry. We can still say things like "most agents stay alive in Pac-Man and postpone ending a Tic-Tac-Toe game", but only in the limit of farsightedness ( $γ \to 1$ ) by taking advantage of the distribution of terminal states. The theory does still (IMO) meaningfully deconfuse us about power and instrumental convergence. None of the proofs are known to me to be incorrect, and similar implications can be drawn (albeit slightly more cautiously or differently worded).

After the holidays, I'll see if we can't get a more appropriate theorem.

Thanks to Rohin Shah and Vanessa Kosoy for pointing out the interpretive mistake. Rohin suggested the college example as a non-abstract story for that environmental structure.

For those of you who have read the paper, I'm talking about the last theorem. The problem: saying the POWER contribution of some possibilities relates to their optimality measure doesn't tell us anything without already knowing that optimality measure. ↩︎

Here's my explanation of what's going on with that last theorem:

Consider some state s in a deterministic finite MDP with a perfectly optimal agent, where the rewards for each state are sampled uniformly and iid from the interval [0, 1]. We can "divide up" POWER(s) into contributions from all of the possibilities that are optimal for at least one reward, with the contributions weighted by the optimality measure for each possibility. (This is why POWER contribution depends on the optimality measure.) The paper proves that if one set of paths contributes 2K times as much power as another set, the first set must be at least K times more likely.

I was initially confused why this notion of power doesn't directly correspond to instrumental convergence, but instead only puts a bound on instrumental convergence. This is because expected reward can vary across possibilities. In particular, if you have two non-dominated possibilities f1 and f2, and you choose a random reward r1 (respectively, r2) that f1 (respectively, f2) is optimal for, then expected reward of f1 under r1 can be different from expected reward of f2 under r2. This changes the relative balance of power between them but doesn’t change the relative balance of the probability of each possibility.

One idea how this formalism can be improved, maybe. Consider a random directed graph, sampled from some "reasonable" (in some sense that needs to be defined) distribution. We can then define "powerful" vertices as vertices from which there are paths to most other vertices. Claim: With high probability over graphs, powerful vertices are connected "robustly" to most vertices. By "robustly" I mean that small changes in the graph don't disrupt the connection. This is because, if your vertex is connected to everything, then disconnecting some edges should still leave plenty of room for rerouting through other vertices. We can then interpret it as saying, gaining power is more robust to inaccuracies of the model or changes in the circumstances than pursuing more "direct" paths to objectives.

If an agent is randomly placed in a given distribution of randomly connected points, I see why there are diminishing returns on seeking more power, but that return is never 0, is it?

This gives me pause.

Can you expand? Also, what's the distribution of reward functions in this scenario – uniform?

You say:

"most agents stay alive in Pac-Man and postpone ending a Tic-Tac-Toe game", but only in the limit of farsightedness (γ→1)

I think there are two separable concepts at work in these examples, the success of an agent and the agent's choices as determined by the reward functions and farsightedness.

If we compare two agents, one with the limit of farsightedness and the other with half that, farsightedness (γ→1/2), then I expect the first agent to be more successful across a uniform distribution of reward functions and to skip over doing things like Trade School, but the second agent in light of more limited farsightedness would be more successful if it were seeking power. As Vanessa Kosoy said above,

... gaining is more robust to inaccuracies of the model or changes in the circumstances than pursuing more "direct" paths to objectives.

What I meant originally is that if an agent doesn't know if γ→1, then is it not true that an agent "seeks out the states in the future with the most resources or power? Now, certainly the agent can get stuck at a local maximum because of shortsightedness, and an agent can forgo certain options as result of its farsightedness.

So I am interpreting the theorem like so:

An agent seeks out states in the future that have more power at the limit of its farsightedness, but not states that, while they have more power, are below its farsightedness "rating."

Note: Assuming a uniform reward function.

By farsightedness, I mean the value of the discount factor $γ \in [0, 1)$ , with which the agent geometrically discounts rewards at future time steps. That is, the reward $r$ received $k$ steps in the future is discounted as $γ^{k} r$ . My theorems assume that, given the reward function $R$ , the agent computes the optimal policy (set) for $R$ at discount rate $γ$ .

There's a different (intuitive) notion of farsightedness, in which the agent can only compute policies within a $k$ -neighborhood of the current state. I think this is the notion you're referring to. In this case, gaining power is a good heuristic, as you say.