Reward is the optimization target (of capabilities researchers)

Max H

In Reward is not the optimization target, @TurnTrout writes:

Therefore, reward is not the optimization target in two senses:
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

I think these claims are true and important, though in my own terms, I would rephrase and narrow as:

Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.

I find the claim stated this way more intuitively obvious, but also somewhat less interesting, because the restatement makes the narrowness more explicit.

In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.

Capabilities researchers are not solely concerned with maximizing a particular reward function, since it is not very difficult or interesting to program a bot the old-fashioned way to beat a particular Atari game. For other games (e.g. Go) it is harder to beat human or machine performance by using traditional programming techniques, and trying doesn't make for a compelling AI research agenda.

Aside from the headline metric of how well a new RL method does in terms of training policies which result in a high reward when executed, RL researchers place importance on:

Efficiency, in many senses:
- How much computing power does it take to train a policy to perform well in a particular domain?
- How much (and what kind of) training data does it take to train a policy?
- How much computing power and space (i.e. size of the model) does it take to execute the policy?
Generality of methods across domains: for example, can the same RL training process, applied to the same network architecture but with different training data, be applied to train policies and create systems which play many different types of games well? Dreamer is an example of a very general and powerful RL method.^[1]
Generality of the trained system across domains: can a single system be trained to perform well in a variety of different domains, without retraining? Gato and LLMs are examples of this kind of generality.

Why is this observation important?

In my view, current RL methods have not yet advanced to the point of creating systems which can be indisputably described as agents which has any kind of values at all.^[2]

I view most attempts to draw parallels between high-level processes that happen in current-day AI systems and human brains as looking for patterns which do not yet exist.

Speculating on what kind of agents and values current DL-paradigm RL methods might produce in the future can be valuable and important research, but I think that it is important to remain grounded about what current systems are actually doing, and to be precise with terms.

As an example of where I think a lack of grounding about current systems and methods leads to things going wrong, in Evolution provides no evidence for the sharp left turn, Quintin Pope writes:

In my frame, we've already figured out and applied the sharp left turn to our AI systems, in that we don't waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization.

But the actual sharp left turn problem is about systems which are indisputably agentic and reflective already.

Similarly, on Inner and outer alignment decompose one hard problem into two extremely hard problems, I am remain skeptical that there is any process whatsoever within current-day systems for which it is meaningful to talk about inner alignment or as having values in any sense. These points are debatable, and I am not providing much evidence or explaining my own views in detail here. I am merely claiming that these are points which are up for debate.

A note on terminology

Throughout this post, I have used somewhat verbose and perhaps nonstandard phrasings like "executing a trained policy", to make the type of the object or concept I am talking about precise. I think it is sometimes worth being very precise and even pedantic about types when talking about these things, because it can make implicit assumptions more explicit in the text. This has benefits for clarity even when there is no disagreement.

I'm not claiming that this terminology or verbosity should be standard, but my own preferred way of thinking of things in the field of RL is explained by the following paragraph:

Reinforcement learning methods are human-readable instructions, algorithms, and theories for designing and building RL-based AI systems. These methods usually involve training a policy, which is then deployed in a system which feeds input and state into the policy and repeatedly executes the output of the policy (using some simple selection rule, if the policy is probabilistic) in an appropriate environment or domain. It's often useful to model such a system as an agent within a particular domain, but I dispute that any current system has properties which are similar in type to the kind of agency and values attributed to humans.

I am not an expert in the field of RL, but I don't think any usage of the italicized terms in the paragraph above is particularly controversial or nonstandard. Feel free to correct me or propose better terms in the comments if not.

I'm not opposed to using standard shorthand when it's clear to experienced practitioners what the author means, but I think in posts which discuss both policies and agents, it is important to keep these distinctions in mind and sometimes make them explicit.

A closing thought experiment

In a recent post on gradient hacking, I described a thought experiment:

Suppose a captive human is being trained by alien scientists to predict alien text using the following procedure:
An alien instructor presents the human with an incomplete sentence or phrase for which the instructor knows the right completion, and asks the human to predict the next word or words.
If the human's prediction differs from the instructor's answer key, the human immediately undergoes neurosurgery, in which their brain is reconfigured so that they are more likely to give the right answer (or at least something closer to it), the next time. (The aliens are very good at making fine-grained mechanical adjustments to the parts of the human's brain responsible for language prediction, which can add up to large behavioral changes in the aggregate. But the aliens lack a macroscopic / algorithms-level understanding of the workings of the human brain.)
If the human gets many examples in a row correct or close enough (according to the instructor), the training and surgery process is considered to have succeeded, and the human is deployed to predict text in a real environment.
How might a human who wants to avoid neurosurgery (or just get to the end of the training process faster) game this procedure?
Perhaps the alien instructor is careless and leaves a note card with the expected completion lying around in the human's line of sight. Or, maybe the aliens are blind, and communicate using a braille-based system, or use a different part of the EM spectrum for perception.
As a result of carelessness or differing modes of perception, the alien instructor leaves the answer key displayed in way that is visible to the human during training, not realizing that that the human can perceive it.
The human notices the answer key and proceeds to make "predictions" about the most likely next word which are perfect or near-perfect. Maybe for plausible deniability, the human occasionally makes a deliberate mistake, and as a result undergoes relatively minor brain surgery, which doesn't affect their ability to notice the note card in the future, or have a big overall effect on their brain architecture.
The alien scientists are very pleased with their human training process and believe they have trained a human with far-superhuman (super-alien?) capabilities at alien text prediction.
The aliens proceed to deploy their human test subject to production, where at best, the human turns out not to be great at text prediction after all, or at worst, rebels and kills the aliens in order to escape.

Aside from making a point about gradient hacking, I think this thought experiment is useful for building an intuition for why reward is not the optimization target of the system being trained.

The human subject in the thought experiment would indeed be very unlikely to intrinsically value scoring highly on the metric which the aliens use to evaluate the human's performance during training. But the human might seek to maximize (or at least manipulate) the metric during training anyway, in order to deceive the aliens into ending the training process.

I think this helps build an intuition for why inner alignment may be a problem in future, more capable AI systems which has not yet shown up in any real systems.

^{^}
The observation that the Dreamer authors exerted strong optimization power to design an effective RL method is what led me to make the prediction here.
My guess is that an RL policy trained using Dreamer will look more like a reward-function maximizer for cheese-finding in a maze, because the developers of the Dreamer algorithm were more focused on developing a maximizer-building algorithm than Langosco et al., who merely wanted an RL algorithm that was good enough to produce policies which they could study for other purposes. (This is not at all meant as a knock on Langosco or Turner et al.'s work! I take issue with some of their conclusions, but personally, I think their work is valuable and net-positive, and the publication of Dreamer is negative.)
^{^}
I do think modeling RL-based systems as agents in particular domains is a useful tool for understanding the behavior of these systems, but I haven't yet seen an AI system which I would consider to actually unambiguously have any real agency or values whatsoever, which in my view is a fact about the underlying processes within a system which generate its behavior.

AI ALIGNMENT FORUM
AF

17

Reward is the optimization target (of capabilities researchers)

17

17

Reward is the optimization target (of capabilities researchers)

17

Why is this observation important?

A note on terminology

A closing thought experiment