(This post is largely a write-up of a conversation with Scott Garrabrant.)

Stable Pointers to Value

How do we build stable pointers to values?

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey's Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving box to get more reward, it will.

We can fix this problem by integrating the same reward box with the agent in a better way. Rather than having the RL agent learn what the output of the box will be and plan to maximize the output of the box, we use the box directly to evaluate possible futures, and have the agent plan to maximize that evaluation. Now, if the agent considers modifying the box, it evaluates that future with the current box. The box as currently configured sees no advantage to such tampering. This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of "learning what is being rewarded" that RL agents were supposed to provide, but without the perverse incentive to try and make the utility turn out to be something easy to maximize.

This feels much like a use/mention distinction. The RL agent is maximizing "the function in the utility module", whereas the observation-utility agent (OU agent) is maximizing the function in the utility module.

The Easy vs Hard Problem

I'll call the problem which OU agents solve the easy problem of wireheading. There's also the hard problem of wireheading: how do you build a stable pointer to values if you can't build an observation-utility box? For example, how do you set things up so that the agent wants to satisfy a human, without incentivising the AI to manipulate the human to be easy to satisfy, or creating other problems in the attempt to avoid this? Daniel Dewey's approach of incorporating uncertainty about the utility function into the utility box doesn't seem to cut it -- or at least, it's not obvious how to set up that uncertainty in the right way.

The hard problem is the wireheading problem Which Tom Everitt attempts to make progress on in Avoiding Wireheading with Value Reinforcement Learning and Reinforcement Learning with a Corrupted Reinforcement Channel. It's also connected to the problem of Generalizable Environmental Goals in AAMLS. CIRL gets at an aspect of this problem as well, showing how it can be solved if the problem of environmental goals is solved (and if we can assume that humans are perfectly rational, or that we can somehow factor out their irrationality -- Stuart Armstrong has some useful thoughts on why this is difficult). Approval-directed agents can be seen as an attempt to turn the hard problem into the easy problem, by treating humans as the evaluation box rather than trying to infer what the human wants.

All these approaches have different advantages and disadvantages, and the point of this post isn't to evaluate them. My point is more to convey the overall picture which seems to connect them. In a sense, the hard problem is just an extension of the same use/mention distinction which came up with the easy problem. We have some idea how to maximize "human values", but we don't know how to actually maximize human values. Metaphorically, we're trying to dereference the pointer.

Stuart Armstrong's indifference work is a good illustration of what's hard about the hard problem. In the RL vs OU case, you're going to constantly struggle with the RL agent's misaligned incentives until you switch to an OU agent. You can try to patch things by explicitly punishing manipulation of the reward signal, warping the agent's beliefs to think manipulation of the rewards is impossible, etc, but this is really the wrong approach. Switching to OU makes all of that unnecessary. Unfortunately, in the case of the hard problem, it's not clear there's an analogous move which makes all the slippery problems disappear.

Illustration: An Agent Embedded in Its Own Utility Function

If an agent is logically uncertain of its own utility function, the easy problem can turn into the hard problem.

It's quite possible that an agent might be logically uncertain of its own utility function if the function is quite difficult to compute. In particular, human judgement could be difficult to compute even after learning all the details of the human's preferences, so that the AI needs uncertain reasoning about what the model tells it.

Why can this turn the easy problem of wireheading into the hard problem? If the agent is logically uncertain about the utility function, its decisions may have logical correlations with the utility function. This can give the agent some logical control over its utility function, reintroducing a wireheading problem.

As a concrete example, suppose that we have constructed an AI which maximizes CEV: it wants to do what an imaginary version of human society, deliberating under ideal conditions, would decide is best. Obviously, the AI cannot actually simulate such an ideal society. Instead, the AI does its best to reason about what such an ideal society would do.

Now, suppose the agent figures out that there would be an exact copy of itself inside the ideal society. Perhaps the ideal society figures out that it has been constructed as a thought experiment to make decisions about the real world, so they construct a simulation of the real world in order to better understand what they will be making decisions about. Furthermore, suppose for the sake of argument that our AI can break out of the simulation and exert arbitrary control over the ideal society's decisions.

Naively, it seems like what the AI will do in this situation is take control over the ideal society's deliberation, and make the CEV values as easy to satisfy as possible -- just like an RL agent modifying its utility module.

Obviously, this could be taken as reason to make sure the ideal society doesn't figure out that it's just a thought experiment, or that they don't construct copies of the AI. But, we don't generally want good properties of the AI to rely on assumptions about what humans do; wherever possible, we want to design the AI to avoid such problems.

Indifference and CDT

In this case, it seems like the right thing to do is for the AI to ignore any influence which its actions have on its estimate of its utility function. It should act as if it only has influence over the real world. That way, the ideal society which defines CEV can build all the copies of the AI they want; the AI only considers how its actions have influence over the real world. It avoids corrupting the CEV.

Clearly, this would be an indifference-style solution. What's interesting to me is that it also looks like a CDT-style solution. In fact, this seems like an answer to my question at the end of Smoking Lesion Steelman: a case of ignorance about your own utility function which doesn't arise from an obviously bad agent design. Like the smoking lesion steelman, ignorance about utility here seems to recommend CDT-like reasoning over EDT-like reasoning.

This suggests to me that there is a deep connection between CDT, indifference, stable pointers, and corrigibility. As Jason Konek & Ben Levinstein argued in The Foundations of Epistemic Decision Theory, CDT is about getting the direction-of-fit right in decision theory: you want beliefs to better fit the world (if your beliefs don't match the world, you change your beliefs), but you want the world to better fit your goals (if your goals don't match the world, you change the world). The easy problem of wireheading is to follow the second maxim when you have your utility function in hand. The hard problem of wireheading is to go about this when your utility is not directly observable. If you build a stable pointer to a human, you become corrigible. Doing this correctly seems to involve something which at least looks very similar to indifference.

This picture is a little too clean, and likely badly wrong in some respect: several of these concepts are likely to come apart when examined more closely. Nonetheless, this seems like an interesting way of looking at things.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 5:48 AM

Thanks Abram for this sequence - for some reason I wasn't aware of it until someone linked to it recently.

Would you consider the observation tampering (delusion box) problem as part of the easy problem, the hard problem, or a different problem altogether? I think it must be a different problem, since it is not addressed by observation-utility or approval-direction.

Ah, looks like I missed this question for quite a while!

I agree that it's not quite one or the other. I think that like wireheading, we can split delusion box into "the easy problem" and "the hard problem". The easy delusion box is solved by making a reward/utility which is model-based, and so, knows that the delusion box isn't real. Then, much like observation-utility functions, the agent won't think entering into the delusion box is a good idea when it's planning -- and also, won't get any reward even if it enters into the delusion box accidentally (so long as it knows this has happened).

But the hard problem of delusion box would be: we can't make a perfect model of the real world in order to have model-based avoidance of the delusion box. So how to we guarantee that an agent avoids "generalized delusion boxes"?

The problem of figuring out preference without wireheading seems very similar to the problem of maintaining factual knowledge about the world without suffering from appeals to consequences. In both cases a specialized part of agent design (model of preference or model of a fact in the world) has a purpose (accurate modeling of its referent) whose pursuit might be at odds with consequentialist decision making of the agent as a whole. The desired outcome seems to involve maintaining integrity of the specialized part, resisting corruption of consequentialist reasoning.

With this analogy, it might be possible to transfer lessons from the more familiar problem of learning facts about the world, to the harder problem of protecting preference.

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of "learning what is being rewarded" that RL agents were supposed to provide, but without the perverse incentive to try and make the utility turn out to be something easy to maximize.

I think that Dewey is wrong about RL agents having this problem in general. Dewey wrote (emphasis mine):

Reinforcement learning, we have argued, is not an adequate real-world solution to the
problem of maximizing an initially unknown utility function. Reinforcement learners,
by definition, act to maximize their expected observed rewards; they may learn that
human goals are in some cases instrumentally useful to high rewards, but this dynamic
is not tenable for agents of human or higher intelligence, especially considering the
possibility of an intelligence explosion.

"By definition"

The trouble with the reinforcement learning notion (1) is that it can only prefer or
disprefer future interaction histories on the basis of the rewards they contain. Reinforcement
learning has no language in which to express alternative final goals, discarding all non-
reward information contained in an interaction history.

I will go out on a limb and guess that this paper is nearly entirely wrong in its key claims. Similarly with Avoiding Wireheading with Value Reinforcement Learning

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) is a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward – the so-called
wireheading problem.

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like

For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn't do that when given the opportunity. After doing some philosophy where you try to positively specify what you're trying to train, it's easier to notice that this sort of training still leaves the human-manipulation failure mode open.

After doing this kind of philosophy for a while, it's intuitive to form the more general prediction that if you haven't been able to write down a formal model of the kind of thing you're trying to teach, there are probably easy failure modes like this which your training hasn't attempted to rule out at all.

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think I don't understand what you mean by (2), and as a consequence, don't understand the rest of this paragraph?

WRT (1), I don't think I was being careful about the distinction in this post, but I do think the following:

The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It's a false explanation of why wireheading is a concern.

The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).

To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. 

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. 

This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. 

I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it's closer in practice to "all the hypotheses are around at the beginning" -- it doesn't matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don't change that much by introducing it at different stages in training.

Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.

Let's set aside the question of whether it's true, though, and consider the point you're making.

This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models).

So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn't necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn't necessarily the thing it immediately pushes for/against.

I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.

But I don't disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is "sophisticated enough" to be doing training-process-simulation. 

I'm having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that's not that useful. 

And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

I'm confused about what you mean here. My best interpretation is that you don't think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.

Maybe on your understanding, the actual reason why current RL systems don't wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don't see it.

To me, the tangent space stuff suggests that it's closer in practice to "all the hypotheses are around at the beginning" -- it doesn't matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don't change that much by introducing it at different stages in training.

This seems to prove too much in general, although it could be "right in spirit." If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process. 

And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? 

I'm confused about what you mean here.

I was responding to:

To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models

I bet you can predict what I'm about to say, but I'll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network. 

So I think the statement "how well do the agent's motivations predict the reinforcement event" doesn't make sense if it's cast as "manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds)." I think it does make sense if you think about what behavioral influences ("shards") within the agent will upweight logits on the actions which led to reward.