We need a field of Reward Function Design

Steven Byrnes

(Brief pitch for a general audience, based on a 5-minute talk I gave.)

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of these days. (Others in this category include Yann LeCun and Rich Sutton & David Silver.)

Why do I think that? Well, LLMs are very impressive, but we don’t have AGI (artificial general intelligence) yet—not as I use the term. Humans can found and run companies, LLMs can’t. If you want a human to drive a car, you take an off-the-shelf human brain, the same human brain that was designed 100,000 years before cars existed, and give it minimal instructions and a week to mess around, and now they’re driving the car. If you want an AI to drive a car, it’s … not that.

Teaching a human to drive a car / teleoperate a robot:

Minimal instruction,
30 hours of practice

Teaching an AI to drive a car / teleoperate a robot:

Dozens of experts, 15 years, $5,000,000,000

Anyway, human brains are the only known example of “general intelligence”, and they are “RL agents” in the relevant sense (more on which below). Additionally, as mentioned above, people are working in this direction as we speak. So, seems like there’s plenty of reason to take RL agents seriously.

So the upshot is: we should contingency-plan for real RL agent AGIs—for better or worse.

Reward functions in RL

If we’re talking about RL agents, then we need to talk about reward functions. Reward functions are a tiny part of the source code, with a massive influence on what the AI winds up doing.

For example, take an RL agent like AlphaZero, and give it a reward of +1 for winning at a board game and –1 for losing. As you train it, it will get better and better at winning. Alternatively, give it a reward of –1 for winning and +1 for losing. It will get better and better at losing. So if the former winds up superhuman at Reversi / Othello, then the latter would wind up superhuman at “Anti-Reversi”—an entirely different game! Again, tiny code change, wildly different eventual behavior.

I claim that if you give a powerful RL agent AGI the wrong reward function, then it winds up with callous indifference to whether people live or die, including its own programmers and users.

But what’s the right reward function? No one knows. It’s an open question.

Why is that such a hard problem? It’s a long story, but just as one hint, try comparing:

“negative reward for lying”, versus
“negative reward for getting caught lying”.

The first one seems like a good idea. The second one seems like a bad idea. But these are actually the same thing, because obviously the reward function will only trigger if the AI gets caught.

As it turns out, if you pick up a 300-page RL textbook, you’ll probably find that it spends a few paragraphs on what the reward function should be, while the other 299½ pages are ultimately about how to maximize that reward function—how do the reward signals update the trained model, how the trained model is queried, and sometimes there’s also predictive learning, etc.

Reward functions in neuroscience

…And it turns out that there’s a similar imbalance in neuroscience:

The human brain also has an RL reward function. It’s sometimes referred to as “innate drives”, “primary rewards”, “primary punishers”, etc.—things like ‘pain is bad’ and ‘eating when you’re hungry is good’. And just like in RL, the overwhelming majority of effort in AI-adjacent neuroscience concerns how the reward function updates the trained models, and other sorts of trained model updates, and how the trained models are queried, and so on. This part involves the cortex, basal ganglia, and other brain areas. Meanwhile, approximately nobody in NeuroAI cares about the reward function itself, which mainly involves the hypothalamus and brainstem.

We need a (far more robust) field of “reward function design”

So here’s the upshot: let’s learn from biology, let’s innovate in AI, let’s focus on AI Alignment, and maybe we can get into this Venn diagram intersection, where we can make headway on the question of what kind of reward function would lead to an AGI that intrinsically cares about our welfare. As opposed to callous sociopath AGI. (Or if no such reward function exists, then that would also be good to know!)

Oh man, are we dropping this ball

You might hope that the people working most furiously to make RL agent AGI—and claiming that they’ll get there in as little as 10 or 20 years—are thinking very hard about this reward function question.

Nope!

For example, see:

“The Era of Experience” has an unsolved technical alignment problem (2025), where I discuss a cursory and flawed analysis of reward functions by David Silver & Rich Sutton;
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem (2023), where I discuss a cursory and flawed analysis of (related) “intrinsic cost modules” by Yann LeCun
Book review: “A Thousand Brains” (2021), where I discuss a cursory and flawed analysis of the (related) “old brain” by Jeff Hawkins

…And those are good ones, by the standards of this field! Their proposals are fundamentally doomed, but at least it occurred to them to have a proposal at all. So hats off to them—because most researchers in RL and NeuroAI don’t even get that far.

Let’s all try to do better! Going back to that Venn diagram above…

Reward Function Design: Neuroscience research directions

For the “reward functions in biology” part, a key observation is that the human brain reward function leads to compassion, norm-following, and so on—at least, sometimes. How does that work?

If we can answer that question, it might be a jumping-off point for AGI reward functions.

I worked on this neuroscience problem for years, and wound up with some hypotheses. See Neuroscience of human social instincts: a sketch for where I’m at. But it needs much more work, especially connectomic and other experimental data to ground the armchair hypothesizing.

Reward Function Design: AI research directions

Meanwhile on the AI side, there’s been some good work clarifying the problem—for example people talk about inner and outer misalignment and so on—but there’s no good solution. I think we need new ideas. I think people are thinking too narrowly about what reward functions can even look like.

For a snapshot of my own latest thinking on that topic, see my companion post Reward Function Design: a starter pack.

Bigger picture

To close out, here’s the bigger picture as I see it.

Aligning “RL agent AGI” is different from (and much harder than) aligning the LLMs of today. And the failures will be more like “SkyNet” from Terminator, than like “jailbreaks”. (See Foom & Doom 2: Technical alignment is hard.)

…But people are trying to make those agents anyway.

We can understand why they’d want to do that. Imagine unlimited copies of Jeff Bezos for $1/hour. You tell one of them to go write a business plan, and found and grow and run a new company, and it goes and does it, very successfully. Then tell the next one, and the next one. This is a quadrillion-dollar proposition. So that’s what people want.

But instead of “Jeff Bezos for $1/hour”, I claim that what they’re gonna get is “a recipe for summoning demons”.

Unless, of course, we solve the alignment problem!

I think things will snowball very quickly, so we need advanced planning. (See Foom & Doom 1.) Building this field of “Reward Function Design” is an essential piece of that puzzle, but there are a great many other things that could go wrong too. We have our work cut out.

Can I ask you to unwind the fundamentals a step further, and say why you and neuroscientists in general believe the brain operates by RL and has a reward function? And how far down the scale of life these have been found?

Oh, it’s definitely controversial—as I always say, there is never a neuroscience consensus. My sense is that a lot of the controversy is about how broadly to define “reinforcement learning”.

If you use a narrow definition like “RL is exactly those algorithms that are on arxiv cs.AI right now with an RL label”, then the brain is not RL.

If you use a broad definition like “RL is anything with properties like Thorndike's law of effect”, then, well, remember that “reinforcement learning” was a psychology term long before it was an AI term!

If it helps, I was arguing about this with a neuroscientist friend (Eli Sennesh) earlier this year, and wrote the following summary (not necessarily endorsed by Eli) afterwards in my notes:

Eli doesn’t like the term “RL” in a brain context because of (1) its implication that "reward" is stuff in the environment as opposed to an internal “reward function” built from brain-internal signals, (2) its implication that we’re specifically maximizing an exponentially-discounted sum of future rewards.
…Whereas I like the term “RL” because (1) If brain-like algorithms showed up on GitHub, then everyone in AI would call it an “RL algorithm”, put it in “RL textbooks”, and use it to solve “RL problems”, (2) This follows the historical usage (there’s reinforcement, and there’s learning, per Thorndike’s Law of Effect etc.).
When I want to talk about “the brain’s model-based RL system”, I should translate that to “the brain’s Bellman-solving system” when I’m talking to Eli, and then we’ll be more-or-less on the same page I think?

…But Eli is just one guy, I think there are probably dozens of other schools-of-thought with their own sets of complaints or takes on “RL”.

how far down the scale of life these have been found?

I don’t view this as particularly relevant to understanding human brains, intelligence, or AGI, but since you asked, if we define RL in the broad (psych-literature) sense, then here’s a relevant book excerpt:

Pavlovian conditioning occurs in a naturally brainless species, sea anemones, but it is also possible to study protostomes that have had their brains removed. An experiment by Horridge[130] demonstrated response–outcome conditioning in decapitated cockroaches and locusts. Subsequent studies showed that either the ventral nerve cord[131,132] or an isolated peripheral ganglion[133] suffices to acquire and retain these memories.
In a representative experiment, fine wires were inserted into two legs from different animals. One of the legs touched a saline solution when it was sufficiently extended, a response that completed an electrical circuit and produced the unconditioned stimulus: shock. A yoked leg received shock simultaneously. The two legs differed in that the yoked leg had a random joint angle at the time of the shock, whereas the master leg always had a joint angle large enough for its “foot” to touch the saline. Flexion of the leg reduced the joint’s angle and terminated the shock. After one leg had been conditioned, both legs were then tested independently. The master leg flexed sufficiently to avoid shock significantly more frequently than the yoked leg did, demonstrating a response–outcome (R–O) memory. —Evolution of Memory Systems

Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.

I'm wondering how you research fits into other reward function alignment research such as CHAI's research on CIRL and inverse reinforcement learning, and reward learning theory.

It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.

My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!

…Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.

This post reminded me of a book called The Technological Singularity (2015) by Murray Shanahan that also emphasizes the importance of reward function design for advanced AI. Relevant extract from the book:

"In the end, everything depends on the AI’s reward function. From a cognitive standpoint, human-like emotions are a crude mechanism for modulating behavior. Unlike other cognitive attributes we associate with consciousness, there seems to be no logical necessity for an artificial general intelligence to behave as if it had empathy or emotion. If its reward function is suitably designed, then its benevolence is assured. However, it is extremely difficult to design a reward function that is guaranteed not to produce undesirable behavior. As we’ll see shortly, a flaw in the reward function of a superintelligent AI could be catastrophic. Indeed such a flaw could mean the difference between a utopian future of cosmic expansion and unending plenty, and a dystopian future of endless horror, perhaps even extinction."

Cool, good find!

…Too bad that he seems to be making the common mistake of conflating “reward function” with “utility function” / “goals” (AFAICT from skimming how he uses the term in that book). They’re related but different.