I think AI agents (trained end-to-end) might intrinsically prefer power-seeking, in addition to whatever instrumental drives they gain. 

The logical structure of the argument

Premises

  1. People will configure AI systems to be autonomous and reliable in order to accomplish tasks.
  2. This configuration process will reinforce & generalize behaviors which complete tasks reliably.
  3. Many tasks involve power-seeking.
  4. The AI will complete these tasks by seeking power.
  5. The AI will be repeatedly reinforced for its historical actions which seek power.
  6. There is a decent chance the reinforced circuits (“subshards”) prioritize gaining power for the AI’s own sake, not just for the user’s benefit.

Conclusion: There is a decent chance the AI seeks power for itself, when possible.

 

Read the full post at turntrout.com/intrinsic-power-seeking

Find out when I post more content: newsletter & RSS

Note that I don't generally read or reply to comments on LessWrong. To contact me, email alex@turntrout.com.

New Comment
1 comment, sorted by Click to highlight new comments since:
[-]gwern2323

I'm not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of 'intrinsic' drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:

Relative to other goals, agentic systems are easy to steer to seek power.

Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.

Both of these seem like I would expect from a flexible, intelligent agent which is capable of handling many complicated changing domains, like a LLM: they are easy to steer to seek power (see: all the work on RLHF and the superficiality of alignment and ease of steering and low-dimensional embeddings), and they can execute useful heuristics even if those cannot be easily explained as part of a larger plan. (Arguably, that's most of what they do currently.) In the hypotheticals you give, the actions seem just like a convergent instrumental drive of the sort that an agent will rationally develop in order to handle all the possible tasks which might be thrown at it in a bewildering variety of scenarios by billions of crazy humans and also other AIs. Trying to have 'savings' or 'buying a bit of compute to be safe', even if the agent cannot say exactly what it would use those for in the current scenario, seems like convergent, and desirable, behavior. Like buying insurance or adding validation checks to some new code, usually it won't help, but sometimes the prudence will pay off. As humans say, "shit happens". Agents which won't do that and just helplessly succumb to hardware they know is flaky or give up the moment something is a little more than expensive than average or write code that explodes the instant you look at it funny because you didn't say "make sure to check for X Y & Z" - those agents are not good agents for any purpose.

If there are 'subshards' which achieve this desirable behavior because they, from their own perspective, 'intrinsically' desire power (whatever that sort of distinction makes when you've broken things down that far), and it is these subshards which implement the instrumental drive... so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles "intrinsically desire" to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can't have 'instrumental desire' homunculuses all the way down to the individual transistor or ReLU neuron.