Why almost every RL agent does learned optimization

TL;DR: This post discusses the blurred conceptual boundary between RL and RL $^{2}$ (also known as meta-RL). RL $^{2}$ is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL $^{2}$ emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL $^{2}$ agents.

I've found myself telling this story about the relationship between RL and RL $^{2}$ numerous times in conversation. When that happens, it's usually time to write a post about it.

Most of the first half of the post (which points out that RL $^{2}$ is probably more common than most people think) makes points that are probably already familiar to people who've thought a bit about inner alignment.

The last section of the post (which outlines why learned planning algorithms will probably emerge from scaled up RL $^{2}$ systems) contains arguments that may be less widely appreciated among inner alignment researchers, though I still expect the arguments to be familiar to some.

Background on RL $^{2}$

RL $^{2}$ (Duan et al. 2016), also known as meta-RL (Wang et al. 2016; Beck et al. 2023), is the phenomenon where an RL agent learns to implement another RL algorithm in its internal activations. It's the RL version of 'learning to learn by gradient descent', which is a kind of meta-learning first described in the supervised setting by Hochreiter et al. (2001). These days, in language models it's often called 'in-context learning' (Olssen et al. 2022, Garg et al. 2022).

RL $^{2}$ is interesting from a safety perspective because it's a form of learned optimization (Hubinger et al. 2019): The RL algorithm (the outer optimization algorithm) trains the weights of an agent, which learns to implement a separate, inner RL algorithm (the optimization algorithm).

The inner RL algorithm gives the agent the ability to adapt its policy to a particular task instance from the task distribution on which it is trained. Empirically, agents trained to exhibit RL $^{2}$ exhibit rapid adaptation and zero-shot generalization to new tasks (DeepMind Adaptive Agent team et al. 2023), hypothesis driven exploration/experimentation (DeepMind Open Ended Learning Team et al. 2021), and causal reasoning (Dasgupta et al. 2019). RL $^{2}$ may even underlie human planning, decision-making, social cognition, and moral judgement, since there is compelling evidence that the human prefrontal cortex (which is the area of the brain most associated with those capabilties) implements an RL $^{2}$ system (Wang et al. 2018). These cognitive capabilities are the kind of things that we're concerned about in powerful AI systems. RL $^{2}$ is therefore a phenomenon that seems likely to underlie some major safety risks.

The conditions under which RL $^{2}$ emerges are the default RL training conditions

Ingredients for an RL $^{2}$ cake

The four 'ingredients' required for RL $^{2}$ to emerge are:

The agent must have observations that correlate with reward.
The agent must have observations that correlate with its history of actions.
The agent must have a memory state that persists through time in which the RL $^{2}$ algorithm can be implemented.
The agent must be trained on a distribution of tasks.

These conditions let the agent learn an RL $^{2}$ algorithm because they let the agent learn to adapt its actions to a particular task according to what led to reward. Here's a more detailed picture of the mechanism by which these ingredients lead to RL $^{2}$ :

Thanks to (1), agents tend to learn representations that identify if the agent is getting closer to valuable states.
Thanks to (2), it can learn representations that evaluate whether or not past actions have brought the agent closer to valuable states. To evaluate this, the agent must represent the key task variables that define its current task instance, since states that are valuable in one task instance may not be valuable in another.
Thanks to (3), this information can persist through time such that the agent can gradually refine its representations of what task instance it is in and which are the best actions to take in it.
Only if (4) holds is it even useful to learn representations of task structure, rather than learning a fixed sequence of actions that work in one particular task.

Why these ingredients are the default conditions

This set of conditions are more common than they might initially appear:

Reward- and action-correlated observations:
It's pretty typical that agents observe their environment and that reward tends to come from particular environment states. It's also pretty typical that agents get to observe how the environment changes after they take actions in it. Most games and robotics environments, for instance, have these properties. Cases where this doesn't happen, such as N-armed bandit tasks, are rarer or less interesting from an alignment perspective.
Persistent memory state:
Neural networks that have some sort of memory state that persists through time include recurrent neural networks or transformer-XL (both of which have been used to train RL $^{2}$ agents). Having a memory state makes it easier for RL agents to solve tasks that require memory, which include most tasks in partially observable environments (such as the real world). We should therefore expect memory to be used in the most capable and useful RL systems.

But even agents that are purely feedforward often have access to an external memory system: the environment. Even simple feedforward RL agents can and do learn to use the environment as an external memory system when they don't have an internal one (Deverett et al. 2019). The trouble with using the environment as a memory system instead of learning one internally is that the externally represented memories must be learned non-differentiably, which is harder. But it's still possible in principle. Speculatively, large enough agents may be able to learn sophisticated ${RL}^{2}$ algorithms that use an expressive-enough environment as its memory system.
Distribution of tasks:
Most 'individual' tasks are actually a narrow distribution of tasks. Here is a non-exhaustive list of reasons for why the task 'Solve a particular maze' is actually a distribution of tasks:
- If an agent has to solve a maze but starts from a random starting position, that's a task distribution; the agent must learn a policy that works across the whole distribution of initial states.
- If an agent has to solve a maze but its recurrent state is randomly initialized, then that's a task distribution; the agent must learn a policy that works across the initialization distribution of its recurrent state. From location X, if the agent has used a different set of actions to arrive there, its memory state may be different, and thus there will be a distribution over memory states for the task defined as 'Get to the end of the maze from location X'.
- If an agent has to solve a maze but uses a stochastic policy, then that's a task distribution; the agent must learn a policy that works across the distribution of past action sequences.

The (admittedly somewhat pedantic) argument that most tasks are, in fact, distributions of tasks points toward a blurred boundary between 'RL $^{2}$ ' and 'the agent merely adapting during a task'. Some previous debate on the forum about what should count as 'learning' vs. 'adaptation' can be found in comments here and here.

So what? (Planning from RL $^{2}$ ?)

I'm making a pretty narrow, technical point in this post. The above indicates that RL $^{2}$ is pretty much inevitable in most interesting settings. But that's not necessarily dangerous; RL $^{2}$ itself isn't the thing that we should be worried about. We're mostly concerned about agents that have learned how to search or plan (as discussed in Hubinger et al. 2019 and Demski, 2020).

Unfortunately, I think there are a few indications that learned planning will probably emerge from scaled-up RL $^{2}$ :

RL agents with some weak inductive biases show behavioural signs of learned planning (Guez et al. 2019). Being only behavioural, the evidence of planning is currently pretty weak. I'd like to see interpretations of similar agents to show that they have actually learned a planning algorithm (This project idea is on Neel Nanda's 200 Concrete Open Problems in Mechanistic Interpretability).
Other empirical evidence, which is more speculative, comes from what we know about search/planning in humans. I mentioned above that there is evidence that the human prefrontal cortex implements an RL $^{2}$ algorithm (Wang et al. 2018). The PFC is the brain region most heavily implicated in planning, decision-making, etc. This weakly suggests that scaling up RL $^{2}$ might lead to a system that does planning.
The final, and in my opinion most convincing, reason to suspect learned planning might emerge naturally in advanced RL $^{2}$ systems is theoretical. The argument is based on results from Ortega et al. (2019). I go into a little bit more detail in the footnotes, but briefly: The Bayesian optimization objective that RL $^{2}$ agents implicitly optimize has a structure that resembles planning; the objective demands consideration of multiple possible world states and demands that the agent chooses actions based on which action is best given those possible worlds. ^[1]

These are, of course, only weak indications that scaling up RL $^{2}$ will yield learned planning. It's still unclear what else, if anything, is required for it to emerge.

^{^}
Footnote: The Bayesian optimization objective that RL $^{2}$ agents implicitly optimize has a structure that resembles planning
Ortega et al. (2019) shows that the policy of an RL $^{2}$ agent, $π (a_{t})$ , is trained to approximate the following distribution:

$P (a_{t} | {a o - - -}_{< t}) = \int_{ψ} P (a_{t} | ψ, {a o - - -}_{< t}) P (ψ | {a o - - -}_{< t}) d ψ$

where:
$a_{t}$ is the optimal action at timestep $t$ ,
${a o - - -}_{< t}$ the action-observation history up to timestep $t$ ,
$P (a_{t} | {a o - - -}_{< t})$ is the probability of choosing the optimal action given the action-observation history, and
$ψ$ is the set of latent (inaccessible) task parameters that define the task instance. They are sampled from the task distribution. $ψ$ effectively defines the current world state.

How might scaled up RL $^{2}$ agents approximate this integral? Perhaps the easiest method to approximate complicated distributions is a Monte Carlo estimate (i.e. take a bunch of samples and take their average). It seems plausible that agents would learn to take Monte Carlo estimate of this distribution within their learned algorithms. Here's a sketch of what this might look like on an intuitive level:

- The agent has uncertainty over latent task variables/world state given its observation history. It can't consider all the possible configurations of the world state, so it just considers a small sample set of the most likely states of the world according to an internal model of $P (ψ | {a o - - -}_{< t})$ .
- For each of that small sample set of possible world states, the agent considers what the optimal action would be in each case, i.e. $P (a_{t} | ψ, {a o - - -}_{< t})$ . Generally, it's useful to predict the consequences of actions to evaluate how good they are. So the agent might consider the consequences of different actions given different world states and action-observation histories.
- After considering each of the possible worlds, it chooses the action that works best across those worlds, weighted according to how likely each world state is i.e. $\int_{ψ} P (a_{t} | ψ, {a o - - -}_{< t}) P (ψ | {a o - - -}_{< t}) d ψ$

Those steps resemble a planning algorithm.

It's not clear whether agents would actually learn to plan (i.e. learning approximations of each term in the integral that unroll serially, as sketched above) vs. something else (such as learning heuristics that, in parallel, approximate the whole integral). But the structure of the Bayesian optimization objective is suggestive of an optimization pressure in the direction of learning a planning algorithm.

^{^}

Footnote: The Bayesian optimization objective that RL $^{2}$ agents implicitly optimize has a structure that resembles planning

Ortega et al. (2019) shows that the policy of an RL $^{2}$ agent, $π (a_{t})$ , is trained to approximate the following distribution:

$P (a_{t} | {a o - - -}_{< t}) = \int_{ψ} P (a_{t} | ψ, {a o - - -}_{< t}) P (ψ | {a o - - -}_{< t}) d ψ$

where:
$a_{t}$ is the optimal action at timestep $t$ ,
${a o - - -}_{< t}$ the action-observation history up to timestep $t$ ,
$P (a_{t} | {a o - - -}_{< t})$ is the probability of choosing the optimal action given the action-observation history, and
$ψ$ is the set of latent (inaccessible) task parameters that define the task instance. They are sampled from the task distribution. $ψ$ effectively defines the current world state.

How might scaled up RL $^{2}$ agents approximate this integral? Perhaps the easiest method to approximate complicated distributions is a Monte Carlo estimate (i.e. take a bunch of samples and take their average). It seems plausible that agents would learn to take Monte Carlo estimate of this distribution within their learned algorithms. Here's a sketch of what this might look like on an intuitive level:

- The agent has uncertainty over latent task variables/world state given its observation history. It can't consider all the possible configurations of the world state, so it just considers a small sample set of the most likely states of the world according to an internal model of $P (ψ | {a o - - -}_{< t})$ .
- For each of that small sample set of possible world states, the agent considers what the optimal action would be in each case, i.e. $P (a_{t} | ψ, {a o - - -}_{< t})$ . Generally, it's useful to predict the consequences of actions to evaluate how good they are. So the agent might consider the consequences of different actions given different world states and action-observation histories.
- After considering each of the possible worlds, it chooses the action that works best across those worlds, weighted according to how likely each world state is i.e. $\int_{ψ} P (a_{t} | ψ, {a o - - -}_{< t}) P (ψ | {a o - - -}_{< t}) d ψ$

Those steps resemble a planning algorithm.

It's not clear whether agents would actually learn to plan (i.e. learning approximations of each term in the integral that unroll serially, as sketched above) vs. something else (such as learning heuristics that, in parallel, approximate the whole integral). But the structure of the Bayesian optimization objective is suggestive of an optimization pressure in the direction of learning a planning algorithm.

Inner Alignment4Mesa-Optimization4Reinforcement learning4Conjecture (org)3AI1

Frontpage

Why almost every RL agent does learned optimization

4Steve Byrnes

2Lee Sharkey

2Steve Byrnes

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:42 PM

[-]Steve Byrnes2y40

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models. But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.

Maybe what you’re thinking is: “Maybe the learned planning algorithm will have some weird and dangerous goal”. My hunch is: (1) if the original RL agent lacks an affordance for planning in the human-written source code, then it won’t work very well, and in particular, it won’t be up to the task of building a sophisticated dangerous planner with a misaligned goal; (2) if the original RL agent has an affordance for planning in the human-written source code, then it could make a dangerous misaligned planner, but it would be a “mistake” analogous to how future humans might unintentionally make misaligned AGIs, and this problem might be solvable by making the AI read about the alignment problem and murphyjitsu and red-teaming etc., and cranking up its risk-aversion etc.

Sorry if I’m misunderstanding. RL² stuff has never made much sense to me.

[-]Lee Sharkey2y21

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Hm, I don't think this quite captures what I view the post as saying.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models.

As far as there is a safety-related claim in the post, this captures it much better than the previous quote.

But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.

I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that's a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).

I can also imagine a middle ground between our hunches that looks something like "We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn't force it to learn one, yet it did."

[-]Steve Byrnes2y20

Thanks!

One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.

See Section 3 here for why I think it would be a lot worse.

Moderation Log

Curated and popular this week

20AI for AI safety

Joe Carlsmith