Meta: This comment has my thoughts about the paper Pitfalls of Learning a Reward Function Online. I figure I should post them here so that others looking for comments on the paper might find them.

I read the paper in back in 2020; it was on my backlog ever since to think more about it and share my comments. Apologies for the delay, etc.

Mathematical innovation

First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.

The paper looks at options for removing some of these pitfalls, or at least to make them less dangerous. The research agenda pursued by the paper is one I like a lot, an agenda of mathematical innovation. The paper mathematically defines certain provable safety properties (uninfluencability and unriggability), and also explores how useful these might be.

Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.

The math in the paper

That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.

The main issue I will explore is: we have a mathematical property that we label with the natural language word 'uninfluencability'. But does this property actually produce the beneficial 'uninfluencability' effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.

My feeling is that 'uninfluencability', the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function that measures the amount of smiling, by the human teaching the agent. observed over the entire history $h_{n}$ . Take a reward function learning process which assumes (in its prior $ρ$ ) that the probability of the choice for this reward function at the end of the history, $P (R^{s} | h_{n}, ρ)$ , cannot be influenced by the actions taken by the agent during the history, so for example $ρ$ is such that $\forall_{h_{n}} P (R^{s} | h_{n}, ρ) = 1$ , This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.

So it seems to me that the choice taken in the paper to achieve the following design goal:

Ideally, we do not want the reward function to be a causal descendant of the policy.

is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of 'uninfluencability' do what it says on the tin, it seems to me that further constraints need to be added.

Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.

Secondary remarks

Here are some further secondary remarks.

With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.

This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)

More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on $ρ$ , a prior over the agent environment, but does not consider the step of first exploring constraints on $μ$ , the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on $μ$ , and only then to consider the option of projecting these into $ρ$ as a backup option, when $μ$ happens to lack the constraints.

In my mind, the problem of an agent manipulating its teacher or supervisor to maximize its reward is not a problem of machine learning, but more fundamentally a problem of machine reasoning, or even more fundamentally a problem which is present in any game-theoretical setup where rewards are defined by a level of indirection. I talk more at length about these methodological points in my paper on counterfactual planning.

If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that 'uninfluencability' might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.

Let's ignore that, in reality, no probability is truly $1$ (or $0$ ). ↩︎
I don't want to get into the moral realism debate, but it seems that me and moral realists differ mainly in emphasis: I say "without making assumptions, we can't figure out preferences, so we need to find good assumptions", while they say "having made these good (obvious) assumptions, we can figure out preferences". ↩︎
There are ways of making this work with $n \to \infty$ , but that extra complexity is not needed for this exposition. ↩︎
This is the most general reward function formats; if, for example, we had a Markovian reward function $S$ that just took the latest actions and observations as inputs, this defines an $R^{S} \in R$ such that $R^{S} (h_{t}) = \sum_{i = 1}^{t} S (a_{i} o_{i})$ . ↩︎
A terminological note here. We've decided to describe $ρ$ as a learning process, with "unriggable learning process" and "uninfluenceable learning process" being the terms if $ρ$ has additional properties. But the general $ρ$ includes things we might not feel are anything like "learning", like the AI writing down its own reward function.

So it might make more sense to reserve "learning" for the unriggable processes, and call the general $ρ$ something else. But this is a judgement call, and people generally consider "ask your programmer" or "look at the coin" to be a learning processes, which are very much riggable. So I've decided to call the general $ρ$ a learning process. ↩︎
This "expectation" can be made fully rigorous, since reward functions form an affine space: you can take weighted averages of reward function, $(1 - x) R_{1} + x R_{2}$ . ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Learning and manipulating learning

19

Mathematical innovation

The math in the paper

Secondary remarks

1 Introduction

1.1 Blast from the past: misleadingly named tokens

2 What is learning anyway?

2.1 A simple prior-update process?

2.2 "Asking" the guards

2.3 Suggestively described "learning" processes

2.4 Learning compound truths

2.5 Moral facts (not) in the universe

2.6 A bridge too general

3 Formalism and learning

3.2 Policies, environments, and causal graphs

4 Conditions for a good learning process

4.1 Unriggability

4.2 Uninfluenceability

5 Results