(Brief pitch for a general audience, based on a 5-minute talk I gave.)
My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of these days. (Others in this category include Yann LeCun and Rich Sutton & David Silver.)
Why do I think that? Well, LLMs are very impressive, but we don’t have AGI (artificial general intelligence) yet—not as I use the term. Humans can found and run companies, LLMs can’t. If you want a human to drive a car, you take an off-the-shelf human brain, the same human brain that was designed 100,000 years before cars existed, and give it minimal instructions and a week to mess around, and now they’re driving the car. If you want an AI to drive a car, it’s … not that.
Teaching a human to drive a car / teleoperate a robot: Minimal instruction, | Teaching an AI to drive a car / teleoperate a robot: Dozens of experts, 15 years, $5,000,000,000 |
Anyway, human brains are the only known example of “general intelligence”, and they are “RL agents” in the relevant sense (more on which below). Additionally, as mentioned above, people are working in this direction as we speak. So, seems like there’s plenty of reason to take RL agents seriously.
So the upshot is: we should contingency-plan for real RL agent AGIs—for better or worse.
If we’re talking about RL agents, then we need to talk about reward functions. Reward functions are a tiny part of the source code, with a massive influence on what the AI winds up doing.
For example, take an RL agent like AlphaZero, and give it a reward of +1 for winning at a board game and –1 for losing. As you train it, it will get better and better at winning. Alternatively, give it a reward of –1 for winning and +1 for losing. It will get better and better at losing. So if the former winds up superhuman at Reversi / Othello, then the latter would wind up superhuman at “Anti-Reversi”—an entirely different game! Again, tiny code change, wildly different eventual behavior.
I claim that if you give a powerful RL agent AGI the wrong reward function, then it winds up with callous indifference to whether people live or die, including its own programmers and users.
But what’s the right reward function? No one knows. It’s an open question.
Why is that such a hard problem? It’s a long story, but just as one hint, try comparing:
The first one seems like a good idea. The second one seems like a bad idea. But these are actually the same thing, because obviously the reward function will only trigger if the AI gets caught.
As it turns out, if you pick up a 300-page RL textbook, you’ll probably find that it spends a few paragraphs on what the reward function should be, while the other 299½ pages are ultimately about how to maximize that reward function—how do the reward signals update the trained model, how the trained model is queried, and sometimes there’s also predictive learning, etc.
…And it turns out that there’s a similar imbalance in neuroscience:
The human brain also has an RL reward function. It’s sometimes referred to as “innate drives”, “primary rewards”, “primary punishers”, etc.—things like ‘pain is bad’ and ‘eating when you’re hungry is good’. And just like in RL, the overwhelming majority of effort in AI-adjacent neuroscience concerns how the reward function updates the trained models, and other sorts of trained model updates, and how the trained models are queried, and so on. This part involves the cortex, basal ganglia, and other brain areas. Meanwhile, approximately nobody in NeuroAI cares about the reward function itself, which mainly involves the hypothalamus and brainstem.
So here’s the upshot: let’s learn from biology, let’s innovate in AI, let’s focus on AI Alignment, and maybe we can get into this Venn diagram intersection, where we can make headway on the question of what kind of reward function would lead to an AGI that intrinsically cares about our welfare. As opposed to callous sociopath AGI. (Or if no such reward function exists, then that would also be good to know!)
You might hope that the people working most furiously to make RL agent AGI—and claiming that they’ll get there in as little as 10 or 20 years—are thinking very hard about this reward function question.
Nope!
For example, see:
…And those are good ones, by the standards of this field! Their proposals are fundamentally doomed, but at least it occurred to them to have a proposal at all. So hats off to them—because most researchers in RL and NeuroAI don’t even get that far.
Let’s all try to do better! Going back to that Venn diagram above…
For the “reward functions in biology” part, a key observation is that the human brain reward function leads to compassion, norm-following, and so on—at least, sometimes. How does that work?
If we can answer that question, it might be a jumping-off point for AGI reward functions.
I worked on this neuroscience problem for years, and wound up with some hypotheses. See Neuroscience of human social instincts: a sketch for where I’m at. But it needs much more work, especially connectomic and other experimental data to ground the armchair hypothesizing.
Meanwhile on the AI side, there’s been some good work clarifying the problem—for example people talk about inner and outer misalignment and so on—but there’s no good solution. I think we need new ideas. I think people are thinking too narrowly about what reward functions can even look like.
For a snapshot of my own latest thinking on that topic, see my companion post Reward Function Design: a starter pack.
To close out, here’s the bigger picture as I see it.
Aligning “RL agent AGI” is different from (and much harder than) aligning the LLMs of today. And the failures will be more like “SkyNet” from Terminator, than like “jailbreaks”. (See Foom & Doom 2: Technical alignment is hard.)
…But people are trying to make those agents anyway.
We can understand why they’d want to do that. Imagine unlimited copies of Jeff Bezos for $1/hour. You tell one of them to go write a business plan, and found and grow and run a new company, and it goes and does it, very successfully. Then tell the next one, and the next one. This is a quadrillion-dollar proposition. So that’s what people want.
But instead of “Jeff Bezos for $1/hour”, I claim that what they’re gonna get is “a recipe for summoning demons”.
Unless, of course, we solve the alignment problem!
I think things will snowball very quickly, so we need advanced planning. (See Foom & Doom 1.) Building this field of “Reward Function Design” is an essential piece of that puzzle, but there are a great many other things that could go wrong too. We have our work cut out.