Why agents are powerful

Daniel Kokotajlo

[Written for Blog Post Day. Not super happy with it, it’s too rambly and long, but I’m glad it exists.]

Here are some questions I think this theory of agency can answer:

What are agents?
Why should we expect AI agents to be useful for various important tasks?
Why should we think agentic mesa-optimizers may arise for some data+reward-signals that weren’t designed explicitly to produce them?
Why should we think humans+AI tools will eventually be outcompeted by AI agents?
Why should we expect, on priors, mildly superhuman AI agents to be powerful enough to take over the world, if they wanted to?

What are agents?

Earlier in “Agents as P₂B chain reactions” I defined agents as successful P₂B feedback loops—a learner algorithm and a planner algorithm hooked up together, the planner plans to plan better and thus outputs actions that result in getting more useful data into the learner, more resources for the planner to plan with… then the process repeats… Like how “fire” is a successful feedback loop in which heat+plasma+fuel come together to produce more heat+plasma, and envelop more fuel, if any is nearby.

Then I asked: Is the planner algorithm strictly necessary here? Could you get something similar to a P₂B feedback loop, but built around something other than a planner? I explored this question indirectly in “Gradations of Agency.” The answer seems to be “sorta.” You can have e.g. a Level 3 system that doesn’t do any planning and yet behaves remarkably similar to a P₂B feedback loop in many contexts, due to imitating the success of others. Such a system would tend to underperform P₂B in contexts where there aren’t good examples to imitate.

We could stick to the original definition and say: Agents are P₂B feedback loops, so, level 4 and above.

But I think it’s probably better to be a bit more galaxy brain and say: the key thing is convergent instrumental resource feedback loops and that thinking about P₂B is a stepping stone which helps us see why such phenomena exist and why the really big ones tend to have planners sitting in the middle directing them, and the really really big ones tend to have level 6+ algorithms sitting in the middle directing them…

By analogy, heat+plasma+wood+oxygenl chain reactions aren’t the key thing. The key thing is chain reactions that convert some sort of fuel into thermal and kinetic energy plus more of the chain reaction. Heat+plasma+wood+oxygen is a particularly important one, due to the ease of creating it and the abundance of fuel for it on Earth, but there’s also e.g. baking soda + vinegar and neutrons+uranium. And also self-replicating nanobots + pretty much anything.

Similarly: The important thing for understanding agency is understanding that the world contains various self-sustaining chain reactions that feed off instrumental resources like data, money, political power, etc. and spread to acquire more such resources. Level 4 P₂B loops are a particularly important, powerful, and widespread instance of this type, analogous to fire, but there are less powerful/important things (such as Level 3, Level 2…) and also more powerful/important things (Level 6 and above?).

Why should we expect AI agents to be useful for various important tasks?

Why should we think agentic mesa-optimizers may arise for some data+reward-signals that weren’t designed explicitly to produce them?

Both questions have the same answer. Agency works; it’s powerful and general. Recalling this graph and the conjectures from last post, I claim that as you crank up the “how fast does this thing learn, how high a score does it get, in how large and diverse a range of environments?” dial, you ascend the hierarchy until you get agents. You can get powerful non-agents but only by doing something clever and costly, only by selecting against agency or something similar. If you just select/optimize for power/generality/etc. then you get agents. (In other words, for sufficiently large x, the most powerful/general/etc. algorithm with x complexity and compute cost is an agent.)

The reason agency works basically boils down to the claims made way back in P₂B: Plan to P₂B Better. Accumulating knowledge, power, resources, etc. and using them to accumulate even more such things, in an epic feedback loop, until you can convert that pile of knowledge+power+resources into X, is a great way to achieve X, for a very broad range of X. In fact it’s the best/easiest way, for a broad range of definitions of best/easiest. These piles of resources don’t accumulate by themselves; they accrue around algorithms, algorithms such as the learner (level 2), the imitator (level 3), the planner (level 4) … and they accrue faster around higher-level algorithms.

(Insofar as there is a better strategy for achieving X than the above, a level 4+ algorithm sitting on top of a sufficiently large pile of knowledge+power+etc. will realize this and pursue that better strategy instead, self-modifying if necessary.)

The analogy to automobiles is relevant here. Let X be some transportation job you want to automate—such-and-such needs to be moved from A to B, quickly cheaply and safely. For a very broad range of X, the best strategy is to use some sort of auto-mobile: a machine that moves itself. As opposed to e.g. a series of conveyor belts, or vacuum pipes, or a bucket brigade of cranes. While there are gains from specialization, it’s much better to build “general automobiles” that are modular — they have a compartment for holding the passengers/cargo, and then the rest of the machine is dedicated to quickly, cheaply, and safely moving the entire assemblage. The strategy is to move the entire assemblage from pretty-much-A to pretty-much-B, with a short step at the beginning and end for loading and unloading. Some kinds of automobiles rely on the cargo for motive power, e.g. a dirtbike relies on the physical strength of the rider and their shifting center of mass to steer and pop wheelies over curbs etc. But most, and especially the more powerful ones (the ones required for the more difficult X), treat the cargo/passengers as useless dead weight, because they simply aren’t able to contribute much except at the beginning and end (loading and unloading).

Similarly, for a wide range of goals (metrics X that you want to go up), if X is sufficiently challenging, the optimal strategy for achieving X is to build a modular, agentic system that has a compartment for holding X, and that mostly ignores X “during the journey.” X is stored away safely somewhere while the agent accumulates more data, resources, power, etc. so that it can accumulate even more of the same… and then eventually it has so much of all these things that it can just “directly achieve the goal” slash “switch to exploitation of X.” Then X is “unloaded” and starts actually influencing how the system behaves.

All of this is an empirical claim about the structure of our world. There are possible worlds where it isn’t true, just as there are possible worlds where building automobiles was more expensive and less effective than putting conveyor belts everywhere. But it seems clear that we aren’t in one of those worlds.

Why should we think humans+AI tools will eventually be outcompeted by AI agents?

So we’ve got these AI agents, algorithms that repeatedly P₂B, accumulating piles of knowledge+power+money around them. Great. Humans and human corporations also do that. And as AI capabilities advance, and AI agents get more powerful, AI tools will also get more powerful. Perhaps humans + AI tools will be able to always stay one step ahead of AI agents.

I don’t think so. AGI agents can use the AI tools too.

Recall the automobiles analogy: For sufficiently powerful automobiles doing sufficiently difficult transportation jobs, it’s counterproductive try to make use of the human passenger’s muscles. The passenger is dead weight. Similarly, for agency: Agents are feedback loops; an agent without a human-in-the-loop is an AI agent, an agent with a human-in-the-loop (can be) a human agent. (Depends on which part of the loop the human does.) Whatever it is that the human is doing, can be done faster and better by AI. The only way for a human-in-the-loop system to be competitive, in the limit of increased powerfulness, is for the human to stay out of the way and just come along for the ride. Such a system is a human-aligned AGI agent, not a human+tool hybrid agent.

(Continuing the analogy, we could argue that human-aligned AGI agents will eventually be outcompeted by unaligned AGI agents in the same way that aircraft steered by computers but still with a human pilot present in the cockpit will eventually be outcompeted by drones that don’t need a cockpit. However, hopefully we won’t ever let competition get that fierce; hopefully once we have human-aligned AGI agents they will be able to prevent the creation of unaligned AGI thenceforth.)

Why should we expect, on priors, mildly superhuman AI agents to be powerful enough to take over the world, if they wanted to?

I have a lot to say on the subject of how easy it would be for mildly superhuman AI agents to take over the world. I’ve already said some of it already. In this section I’ll give a brief, abstract argument from analogy:

Suppose there was a different and exotic chemical reaction called “Magefyre.” Magefyre is purple instead of orange. Whereas regular fire takes wood, oil, etc. plus oxygen and produces more of itself, sending out sparks in all directions, etc., Magefyre does the same thing but is slightly “better” … more heat, more sparks, the sparks fly farther, more diverse kinds of fuel can catch Magefyre than catch fire, the fuel burns more quickly…

If a city is in the process of burning down due to regular fire, and a wizard starts a small Magefyre in some corner of the city, we should expect the Magefyre to eventually “take over” and end up burning down more of the city than the regular fire. The exceptions would be if the whole city burns down quickly before Magefyre can grow big, or if the Magefyre gets an unlucky start and finds itself surrounded by large amounts of regular fire that use up all the fuel in the vicinity. (The analogy to COVID strains is also helpful here; consider how “better” strains came along and quickly became dominant, even though they started with orders of magnitude fewer hosts.)

Thus, my answer: There is some core metric of “how good are you at quickly and robustly accumulating convergent instrumental resources you can use to repeat this process…” In other words, how powerful of an agent are you? There is an analogous metric for different kinds of fire, and for different strains of a virus.

Exceptions excluded, in the long run, small differences in agents in this metric will translate into large differences in outcome; the most powerful agent will win. And the long run here isn’t actually that long; it will feel surprisingly short, just as it intuitively feels like Omicron Strain took over quickly from Delta.

Presumably this metric is an aggregate of different sub-metrics. For example if Magefyre is sufficiently superior to fire in some ways, it can still be overall better than fire even if it is worse in other ways. With agency, we should note that we could have mildly superhuman AI agents (in the relevant metric) that are still clearly subhuman in some important skills.

Example: Maybe the first “mildly superhuman” AGI agent uses 100x more compute than the human brain per second, and also requires 100x more experience/data to learn anything. Also, it can’t do image processing or motor control. So it’s significantly subhuman in these ways. But the AI lab that built it has enough compute to run 100,000 copies in parallel, and they have the key advantage that they can share their experience/learnings with each other, and so they end up quickly accumulating more know-how in relevant areas like reasoning, strategy, persuasion, etc. than any human ever, and in particular more than the groups of humans who currently oppose them; they then leverage this know-how to neutralize their immediate opponents, consolidate their power, and accumulate more compute + human patsies…

Wrapping up

What do I think of this theory, overall? Ehh, it’s definitely a beta version at best. I’ve rambled for many pages now and given lots of analogies and intuitions and models but nothing like a proof or decisive empirical study.

However, I think it’s on the right track. I also think that writing and thinking through this theory has helped me to better understand what various other people have said about agency (e.g. Yudkowsky/Veedrac), and perhaps reading it will have the same effect in others. I’d be interested to hear if so; otherwise I’ll assume not…

I might end up adding more posts to this sequence, e.g. “Here’s a criticism of the theory that seems plausible to me” or “Here’s an elegant restatement” or “here’s an important implication I hadn’t noticed before.” But for now this is the last one.

Thanks again to those who helped me with these ideas, especially Ramana Kumar.

AI ALIGNMENT FORUM
AF