Reward Is Not Enough

Nice post!

I'm generally bullish on multiple objectives, and this post is another independent arrow pointing in that direction. Some other signs which I think point that way:

The argument from Why Subagents?. This is about utility maximizers rather than reward maximizers, but it points in a similar qualitative direction. Summary: once we allow internal state, utility-maximizers are not the only inexploitable systems; markets/committees of utility-maximizers also work.
The argument from Fixing The Good Regulator Theorem. That post uses some incoming information to "choose" between many different objectives, but that's essentially emulating multiple objectives. If we have multiple objectives explicitly, then the argument should simplify. Summary: if we need to keep around information relevant to many different objectives, but have limited space, that forces the use of a map/model in a certain sense.

One criticism: at a few points I think this post doesn't cleanly distinguish between reward-maximization and utility-maximization. For instance, the optimizing for "the abstract concept of ‘I want to be able to sing well’" definitely sounds like utility-maximization.

[-]Steven Byrnes4y40

Thanks!

I had totally forgotten about your subagents post.

this post doesn't cleanly distinguish between reward-maximization and utility-maximization

I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model". Then "the reward as currently understood by my predictive model" is basically a utility function. But at the same time, there's a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors).

In other words: At any given time, the part of the agent that's making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent.

Please tell me if that makes any sense or not; I've been planning to write pretty much exactly this comment (but with a diagram) into a short post.

[-]johnswentworth4y30

Good explanation, conceptually.

Not sure how all the details play out - in particular, my big question for any RL setup is "how does it avoid wireheading?". In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.

[-]Steven Byrnes4y30

how does it avoid wireheading

Um, unreliably, at least by default. Like, some humans are hedonists, others aren't.

I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.)

Anyway, insofar as "the reward signal itself" is part of the world-model, it's possible that reward-prediction / value will wind up attached to that concept. And then that's a desire to wirehead. But it's not inevitable. Some of the relevant dynamics are:

Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit.
Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the "eating candy" concept predicts certain reward signals, and then we get older and learn that the "certain neural signals in my brain" concept predicts those same reward signals too. But just learning that fact doesn't automatically translate into "I really want those certain neural signals in my brain". Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking.)
There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such a model exists, other things equal. (I'm thinking here of the relation between amygdala predictions, which I think are restricted to relatively simple functions of sensory input, versus mPFC predictions, which I think can involve more abstract situational knowledge. I'm still kinda confused about how this works though.)
There's a difference between hedonism-lite ("I want to feel good, although it's not the only thing I care about") and hedonism-level-10 ("I care about nothing whatsoever except feeling good"). My model would suggest that hedonism-lite is widespread, but hedonism-level-10 is vanishingly rare or nonexistent, because it requires that somehow all value gets removed from absolutely everything in the world-model except that one concept of the reward signal.

For AGIs we would probably want to do other things too, like (somehow) use transparency to find "the reward signal itself" in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is "wireheading-lite", where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.

[-]Donald Hobson4y50

I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely.

In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self modification is on the table, these tricks will tend to fail. (I think)

Use the same brain architecture in a moderately different environment, and you get people freezing their credit card in blocks of ice so they can't spend it, and other self defeating behaviour. I suspect the tricks will fail much worse with any change to mental architecture.

On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI.

[-]Steven Byrnes4y30

On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI.

Sure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.

I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely.

Sure, that's possible.

My "negative" response is: There's no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about "subagent"-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there's no way to safely deal with that kind of situation, then I think we're doomed. Why do I think that? For one thing, as I wrote in the text, it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we're definitely in that situation, because these are subsystems that are manipulating the AGI and don't share the AGI's (current) goals. For another thing: it's a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn't expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and "bid against each other". Now just apply exactly that same reasoning to "having desires about the state of the (complicated) world", and you wind up concluding that "subagents working against each other" is a default expectation and maybe even inevitable.

My "positive" response is: I certainly wouldn't propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don't even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.

[-]sebastiankosch4y10

I found myself struggling a bit to really wrap my mind around your phasic dopamine post, and the concrete examples in here helped me get a much clearer understanding. So thank you for sharing them!

Do you have a sense of how these very basic, low-level instances of reward-based learning (birds learning to sing, campers learning to avoid huge spiders, etc.) map to more complex behaviours, particularly in the context of human social interaction? For instance, would it be reasonable to associate psychological "parts", to borrow from Internal Family Systems terminology, with spaghetti towers of submodules that each have their own reward signal (or perhaps some of them might share particular types of reward signals)?

[-]Steven Byrnes4y*30

Thanks!

My current working theory of human social interactions does not involve multiple reward signals. Instead it's a bunch of rules like "If you're in state X, and you empathetically simulate someone in state Y, then send reward R and switch to state Z". See my post "Little glimpses of empathy" as the foundation of social emotions. These rules would be implemented in the hypothalamus and/or brainstem.

(Plus some involvement from brainstem sensory-processing circuits that can run hardcoded classifiers that return information about things like whether a person is present right now, and maybe some aspects of their tone of voice and facial expressions, etc. Then those data can also be inputs to the "bunch of rules".)

I haven't thought it through in any level of detail or read the literature (except superficially). Maybe ask me again in a few months… :-)

[-]M. Y. Zuo4y10

Interesting post!

Query: How do you define ‘feasibly’? as in ‘Incentive landscapes that can’t feasibly be induced by a reward function’

As from my perspective all possible incentive landscapes can be induced by reward, with sufficient time and energy. Of course a large set of these are beyond the capacity of present human civilization.

[-]Steven Byrnes4y00

Right, the word "feasibly" is referring to the bullet point that starts "Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?”". Here's a little toy example we can run with: teaching an AGI "don't kill all humans". So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):

The agent learns by experiencing the reward. This doesn't work for "don't kill all humans" because when the reward happens it's too late.
The reward calculator is sophisticated enough to understand what the agent is thinking, and issue rewards proportionate to the probability that the current thoughts and plans will eventually lead to the result-in-question happening. So the AGI thinks "hmm, maybe I'll blow up the sun", and the reward calculator recognizes that merely thinking that thought just now incrementally increased the probability that the AGI will kill all humans, and so it issues a negative reward. This is tricky because the reward calculator needs to have an intelligent understanding of the world, and of the AGI's thoughts. So basically the reward calculator is itself an AGI, and now we need to figure out its rewards. I'm personally quite pessimistic about approaches that involve towers-of-AGIs-supervising-other-AGIs, for reasons in section 3.2 here, although other people would disagree with me on that (partly because they are assuming different AGI development paths and architectures than I am).
Same as above, but instead of a separate reward calculator estimating the probability that a thought or plan will lead to the result-in-question, we allow the AGI itself to do that estimation, by flagging a concept in its world-model called "I will kill all humans", and marking it as "very bad and important" somehow. (The inspiration here is a human who somehow winds up with the strong desire "I want to get out of debt". Having assigned value to that abstract concept, the human can assess for themselves the probabilities that different thoughts will increase or decrease the probability of that thing happening, and sorta issue themselves a reward accordingly.) The tricky part is (A) making sure that the AGI does in fact have that concept in its world-model (I think that's a reasonable assumption, at least after some training), (B) finding that concept in the massive complicated opaque world-model, in order to flag it. So this is the symbol-grounding problem I mentioned in the text. I can imagine solving it if we had really good interpretability techniques (techniques that don't currently exist), or maybe there are other methods, but it's an unsolved problem as of now.

[-]M. Y. Zuo4y00

Could the hypothetical AGI be developed in a simulated environment and trained with proportionally lower consequences?

[-]Steven Byrnes4y20

I'm all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.

So, maybe, but for now I would call that "an intriguing research direction" rather than "a solution".

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

43

43

Three case studies

1. Incentive landscapes that can’t feasibly be induced by a reward function

2. Wishful thinking

3. Deceptive AGIs

Does an agent need to be "unified" to be reflectively stable?

The Fraught Valley

Finally we get to the paper “Reward Is Enough” by Silver, Sutton, et al.