New Comment
15 comments, sorted by Click to highlight new comments since: Today at 2:17 PM

This paper describes a kind of "partially embedded" agent, where the agent explicitly models its reward/utility function (but not other parts of itself) as belonging to the environment and subject to modification (by itself or by the environment), and shows that if an agent uses its current utility function to decide what to do, it won't have an incentive to modify its utility function, and if it properly models what would happen if the utility function is changed by the environment, it will also want to protect its utility function. The paper seems to spend a lot of pages on a relatively simple/intuitive idea that has been discussed on LW in various forms for at least a decade, but maybe this kind of detailed formal treatment will be useful for making some people (ML researchers?) take AI safety ideas more seriously?

I can't resist giving this pair of rather incongruous quotes from the paper:

Fortunately and perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function.

A known reward function also brings RL closer to the frameworks of decision theory and game theory (Osborne, 2003; Steele and Stefansson, 2016), where agents are usually aware of their own reward (or utility) function, and are only uncertain about the outcomes for different actions or policies.

I can't resist giving this pair of rather incongruous quotes from the paper

Could you spell out what makes the quotes incongruous with each other? It's not jumping out at me.

The authors acknowledged that the modifications they did to RL "brings RL closer to the frameworks of decision theory and game theory" (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write "perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function."

We didn't expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.

Ah, that makes sense. I kind of guessed that the target audience is RL researchers, but still misinterpreted "perhaps surprisingly" as a claim of novelty instead of an attempt to raise the interest of the target audience.

Honestly? I feel like this same set of problems gets re-solved a lot. I'm worried that it's a sign of ill health for the field.

I think we understand certain technical aspects of corrigibility (indifference and CIRL), but have hit a brick wall in certain other aspects (things that require sophisticated "common sense" about AIs or humans to implement, philosophical problems about how to get an AI to solve philosophical problems). I think this is part of what leads to re-treading old ground when new people (or a person wanting to apply a new tool) try to work on AI safety.

On the other hand, I'm not sure if we've exhausted Concrete Problems yet. Yes, the answer is often "just have sophisticated common sense," but I think the value is in exploring the problems and generating elegant solutions so that we can deepen our understanding of value functions and agent behavior (like TurnTrout's work on low-impact agents). In fact, Tom's a co-author on a very good toy problems paper, many of which require similar sort of one-off solutions that still might advance our technical understanding of agents.

Yeah, unless I'm missing something, this is the solution to the "easy problem of wireheading" as discussed at Abram Demski, Stable Pointers to Value II: Environmental Goals .

Still, I say kudos to the authors for making progress on exactly how to put that principle into practice.

Hey Steve,

Thanks for linking to Abram's excellent blog post.

We should have pointed this out in the paper, but there is a simple correspondence between Abram's terminology and ours:

Easy wireheading problem = reward function tampering

Hard wireheading problem = feedback tampering.

Our current-RF optimization corresponds to Abram's observation-utility agent.

We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.

Hey Charlie,

Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone's liking, let me just give a little intro / context for it here.

The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.

As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an 'undesired path' in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.

Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.

Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.

Tom

Sure. On the one hand, xkcd. On the other hand, if it works for you, that's great and absolutely useful progress.

I'm a little worried about direct applicability to RL because the model is still not fully naturalized - actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the "right" answer is "sophisticated common sense," but an ad-hoc mostly-answer would still be useful conceptual progress.

Actually, I would argue that the model is naturalized in the relevant way.

When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.

As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.

The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).

The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.

I feel like this same set of problems gets re-solved a lot. I'm worried that it's a sign of ill health for the field.

Maybe the problem is getting everyone on the same page.

Yes, that is partly what we are trying to do here. By summarizing some of the "folklore" in the community, we'll hopefully be able to get new members up to speed quicker.

I think this is a good sign, this paper goes over many of the ideas that the RatSphere has discussed for years, and Deepmind is giving those ideas publicity. It also brings up preliminary solutions, of which, "Model Based Rewards" seems to go farthest in the right direction.(Although even the paper admits the idea's been around since 2011)

However, the paper is still phrasing things in terms of additive reward functions, which don't really naturally capture many kinds of preferences (such as those over possible worlds). I also feel that the causal influence diagrams, when unrolled for multiple time steps, needlessly complicate the issues being discussed. Most interesting phenomena in decision theory can be captured by simple 1 or 2 step games or decision trees. I don't see the need to phrase things as multi-timestep systems. The same goes for presenting the objectives in terms of grid worlds.

Overall, the authors seem to still be heavily influenced by the RL paradigm. It's a good start, we'll see if the rest of the AI community notices.

Thanks for the Dewey reference, we'll add it.