I think agentic systems represent most of AI extinction risk. I want to think clearly about what training procedures produce agentic systems. Unfortunately, the field of reinforcement learning has a convention of calling its trained artifacts "agents." This terminology is loaded and inappropriate for my purposes. I advocate instead calling the trained system a "policy." This name is standard, accurate, and neutral. 

Don't assume the conclusion by calling a policy an "agent"

The real-world systems we want to think about and align are very large neural networks like GPT-4. These networks are trained and finetuned via different kinds of self-supervised and reinforcement learning.

When a policy network is updated using a learning process, its parameters are changed via weight updates. Eventually, the process ends (assuming no online learning for simplicity). We are then left with a policy network (e.g. GPT-4). To actually use the network, we need to use some sampling procedure on its logits (e.g. top-p with a given temperature). Once we fix the policy network and sampling procedure, we get a mapping from observations (e.g. sequences of embeddings, like those for [I love dogs]) to probability distributions over outputs (e.g. tokens). This mapping  is the policy

I want to carefully consider whether a trained policy will exhibit agentic cognition of various forms, including planning, goal-directedness, and situational awareness. While considering this question, we should not start calling the trained policy an "agent"! That's like a detective randomly calling one of the suspects "criminal." I prefer just calling the trained artifact a "policy." This neutrally describes the artifact's function, without connoting agentic or dangerous cognition.

Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present). I think it's appropriate to call that kind of computation "agentic." But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).

There's no deep reason why trained policies are called "agents"

Throughout my PhD in RL theory, I accepted the idea that RL tends to create agents, and supervised learning doesn't. Well-cited papers use the term "agents", as do textbooks and Wikipedia. I also hadn't seen anyone give the pushback I give in this post. 

Question: Given a fixed architecture (e.g. a 48-layer decoder-only transformer), what kinds of learning processes are more likely to train policy networks which use internal planning? 

If you're like I was in early 2022, you might answer "RL trains agents." But why? In what ways do PPO's weight updates tend to accumulate into agentic circuitry, while unsupervised pretraining on OpenWebText does not? 

Claim: People are tempted to answer "RL" because the field adopted the "agent" terminology for reasons unrelated to the above question. Everyone keeps using the loaded terminology because no one questions it.

Let's be clear. RL researchers did not deliberate carefully about the inductive biases of deep learning, and then decide that a certain family of algorithms was especially likely to train agentic cognition. Researchers called policies "agents" as early as 1995, before the era of deep learning (e.g. see AI: A modern approach, 1st edition).

Does RL actually produce agents?

Just because "agents" was chosen for reasons unrelated to agentic cognition, doesn't mean the name is inappropriate. I can think of a few pieces of evidence for RL entraining agentic cognition. 

  1. RL methods are often used to train networks on tasks like video games and robotics. These methods are used because they work, and these tasks seem to have an "autonomous" and "action-directed" nature. This is weak evidence of RL being appropriate for producing agentic cognition. Not strong evidence.
  2. RL allows reinforcing behavior[1] which we couldn't have demonstrated ourselves. For example, actuating a simulated robot to perform a backflip. If we could do this ourselves and had the time to spare, we could have just provided supervised feedback. But this seems just like a question of providing training signal in more situations. Not strong evidence.
  3. Many practical RL algorithms are on-policy, in that the policy's current behavior affects its future training data. This may lead to policies which "chain into themselves over time." This seems related to "nonmyopic training objectives." I have more thoughts here, but they're still vague and heuristic. Not strong evidence.
  4. There's some empirical evidence from Discovering Language Model Behaviors with Model-Written Evaluations, which I've only skimmed. They claim to present evidence that RLHF increases e.g. power-seeking. I might end up finding this persuasive.
  5. There's good evidence that humans and other animals do something akin to RL. For example, something like TD learning may be present in the brain. Since some humans are agentic sometimes, and my guess is that RL is one of the main learning processes in the brain, this is some evidence for RL producing agentic cognition. 

Overall, I do lean towards "RL is a way of tying together pretrained cognition into agentic goal pursuit." I don't think this conclusion is slam-dunk or automatic, and don't currently think RL is much more dangerous than other ways of computing weight updates. I'm still trying to roll back the invalid updates I made due to the RL field's inappropriate "agents" terminology. (My current guesses here should be taken strictly separately from the main point of the post.)

Conclusions

  1. Use neutral, non-loaded terminology like "policy" instead of "agent", unless you have specific reason to think the policy is agentic.
    1. Yes, it'll be hard to kick the habit. I've been working on it for about a month.
    2. Don't wait for everyone to coordinate on saying "policy." You can switch to "policy" right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I've enjoyed these benefits for a month. The switch didn't cause communication difficulties.
  2. Strongly downweight the memes around RL "creating agents." 
    1. "RLHF boosts agentic cognition" seems like a contingent empirical fact, and not trivially deducible from "PPO is an RL algorithm." Even if RLHF in fact boosts agentic cognition, you've probably overupdated towards this conclusion due to loaded terminology.
    2. However, only using unsupervised pretraining doesn't mean you're safe. EG base GPT-5 can totally seek power, whether or not some human researchers in the 1970s decided to call their trained artifacts "agents" or not.

Thanks to Aryan Bhatt for clarifying the distinction between policies and policy networks.

Appendix: Other bad RL terminology

"Reward" (bad) -> "Reinforcement" (better)

"Reward" has absurd and inappropriate pleasurable connotations which suggest that the agentpolicy will seek out this "rewarding" quantity. 

I prefer "reinforcement" because it's more accurate (at least for the policy gradient algorithms I care about) and is overall a neutral word. The cost is that "reinforcement function" is somewhat nonstandard, requiring extra explanation. I think this is often worth it in personal and blog-post communication, and maybe also in conference papers.

"Optimal policy" -> "Reinforcement-maximizing policy"

Saying "optimal" makes the policy sound good and smart, and suggests that the reinforcement function is something which should be optimized over. As I discussed in a recent comment, I think that's muddying and misleading. In my internal language, "optimal policy" translates to "reinforcement-maximizing policy." I will probably adopt this for some communication. 

  1. ^

    Technically, we aren't just reinforcing behavior. A policy gradient will upweight certain logits in certain situations. This parameter update generally affects the generalization properties of the network in all situations.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 2:50 AM

How do you think "agent" should be defined?

I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.

Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping  is the policy.…

Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropriate to call that kind of computation "agentic." But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).

I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?

If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.

Don't assume the conclusion by calling a policy an "agent"

The word “agent” invokes a bundle of intuitions / associations, and you think many of those are misleading in general. So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.

Neither option is great; this is obviously a judgment call.

For my part, I think that if I say:

“An RL agent isn’t necessarily planning ahead towards goals, in many cases it’s better to think of it as a bundle of situation-dependent reactions…”

…then that strikes me as a normal kind of thing to say as part of a healthy & productive conversation.

So maybe I see pushing-back-on-the-intuitions-while-keeping-the-word as a more viable approach than you do.

(And separately, I see editing widely-used terminology as a very very big cost, probably moreso than you.)

Ditto for “reward”.

“Reinforcement-maximizing policy”

this kinda sounds slightly weird in my mind because I seem to be intuitively associating “reinforcement” with “updates” and the policy in question is a fixed-point that stops getting updated altogether.

I…don't currently think RL is much more dangerous than other ways of computing weight updates

You mention that this is off-topic so maybe you don’t want to discuss it, but I probably disagree with that—with the caveat that it’s very difficult to do an other-things-equal comparison. (I.e., we’re presumably interested in RL-safety-versus-SSL-safety holding capabilities fixed, but switching from RL to SSL does have an effect on capabilities.)

Then later you say “only using unsupervised pretraining doesn't mean you're safe” which is a much weaker statement, and I agree with it.

So then one approach is to ask everyone to avoid the word “agent” in cases where those intuitions don’t apply, and the other is to ask everyone to constantly remind each other that the “agents” produced by RL don’t necessarily have thus-and-such properties.

I think there's a way better third alternative: asking each reader to unilaterally switch to "policy." No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don't see a case for using "agent" in the mentioned cases. 

I added to the post:

Don't wait for everyone to coordinate on saying "policy." You can switch to "policy" right now and thereby improve your private thoughts about alignment, whether or not anyone else gets on board. I've enjoyed these benefits for a month. The switch didn't cause communication difficulties.

I was just trying to replace “reward” by “reinforcement”, but hit the problem that “negative reward” makes sense, but behaviorist terminology is such that “reinforcement” is always after a good thing happens, including “negative reinforcement”, which would be a kind of positive reward that entails removing something aversive. The behaviorists use the word “punishment” for “negative reward”. But “punishment” has all the same downsides as “reward”, so I assume you’re also opposed to that. Unfortunately, if I avoid both “punishment” and “reward”, then it seems I have no way to unambiguously express the concept “negative reward”.

So “negative reward” it is. ¯\_(ツ)_/¯

Yeah, seems tough to avoid "reward" in that situation. Thanks for pointing this out.

Strong agree with the need for nuance. 'Model' is another word that gets horribly mangled a lot recently.

I think the more sensible uses of the word 'agent' I've come across are usually referring to the assemblage of a policy-under-training plus the rest of the shebang: learning method, exploration tricks of one kind or another, environment modelling (if any), planning algorithm (if any) etc. This seems more legit to me, though I still avoid using the word 'agent' as far as possible for similar reasons (discussed here (footnote 6) and here).

Similarly to Daniel's response to 'reward is not the optimization target' I think you can be more generous in your interpretation of RL experts' words and read less error in. That doesn't mean that more care in communication and terminology would be preferable, which is a takeaway I strongly endorse.

I think you can be more generous in your interpretation of RL experts' words and read less error in.

What other, more favorable interpretations might I consider?

[-]Max H11mo30

Once you have a policy network and a sampling procedure, you can embody it in a system which samples the network repeatedly, and hooks up the I/O to the proper environment and actuators. Usually this involves hooking the policy into a simulation of a game environment (e.g. in a Gym), but sometimes the embodiment is an actual robot in the real world.

I think using the term "agent" for the policy itself is actually a type error, and not just misleading. I think using the term to refer to the embodied system has the correct type signature, but I agree it can be misleading, for the reasons you describe.

OTOH, I do think modelling the outward behavior of such systems by regarding them as agents with black-box internals is often useful as a predictor, and I would guess that this modelling is the origin of the use of the term in RL.

But modelling outward behavior is very different from attributing that behavior to agentic cognition within the policy itself. I think it is unlikely that any current policy networks are doing (much) agentic cognition at runtime, but I wouldn't necessarily count on that trend continuing. So moving away from the term "agent" proactively seems like a good idea.

Anyway, I appreciate posts like this which clarify / improve standard terminology. Curious if you agree with my distinction about embodiment, and if so, if you have any better suggested term for the embodied system than "agent" or "embodiment".

I think the embodiment distinction is interesting and hadn't thought of it before (note that I didn't understand your point until reading the replies to your comment). I'm not yet sure if I find this distinction worth making, though. I'd refer to the embodied system as a "trained system" or -- after reading your suggestion -- an "embodiment." Neither feels quite right to me, though.