It's not clear to me why the BFO would converge to a fixed point of . If we've solved the problem of embedded agency and the AI system knows that can depend on its prediction , then it would tend to find a fixed point, but it could also do the sort of counterfactual reasoning you say it can't do. If we haven't solved embedded agency, then it seems like the function that best explains the data is to posit the existence of some other classifier that works the same way that the AI did in past timesteps, and that . Intuitively, this is saying that the past data is explained by a hypothetical other classifier that worked the same way as the AI used to, and now the AI thinks one level higher than that. This probably does converge to a fixed point eventually, but at any given timestep the best hypothesis would be something that is a finite number of applications of and .
The BFO can generally cope with humans observing
Should this be ?
Unrelated to the effects of the outputs, I think the BFO approach might have the following limitation (this is inspired by your comment on a setup I once described):
It might be the case that in turn , when the programmers try to use the BFO to get answers to some hard questions, the "training distribution" at that point would be "easier" (or otherwise different) than the "test distribution" in some important sense, such that the BFO won't provide useful answers. For example, perhaps a BFO couldn't be used, soon after development, to answer: "If we give $10,000,000 to a research lab X for their project Y, will the first FDA-approved cure for Alzheimer's be approved at least twice sooner?".
But even if that's the case, a BFO might be sufficiently useful, even if it could always answer only questions that are not very different than the previous ones. A gradual process of answering harder and harder questions might provide sufficient value sufficiently quickly.
The BFO can generally cope with humans observing and modifying our behaviour because of it (ie does not need a counterfactual approach).
Stuart, what's your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?
However, it is not a fixed point that the BFO is likely to find, because would not be such an encoding for almost all , so the basin of attraction for this is tiny (basically only those sufficiently small to preserve all the digits of ). Thus the BFO is very unlikely to stumble upon it.
This seems to constitute a form of overfitting or failure to generalize, that is likely to be removed by future advances in ML or AI in general (if there aren't already proposed regularizers that can do so). (A more capable AI wouldn't need to have actually "sumbled upon" the basin of attraction of this fixed point in its past data in order to extrapolate that it exists.) If you stick to using the "standard ML" BFO in order to be safe from such manipulation, wouldn't you lose out in competitiveness against other AIs due to this kind of overfitting / failure to generalize?
Stuart, what's your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?
Basically, yes. I'm no longer sure the terminology and concept of BFO is that useful, and I think all self-confirming oracles have problems. I also believe that these problems need not be cleanly of a "manipulation" or "non-manipulation" type.
I also believe there are smoother ways to reach "manipulation", so my point starting "However, it is not a fixed point that the BFO is likely to find..." is wrong.
I'll add a comment at the beginning of the post, clarifying this post is no longer my current best model.
EDIT: This post has been superceeded by all these four posts.
A few weeks ago, I had a conversation with Paul Christiano about my counterfactual Oracle design.
We didn't get as far as criticising the method. Paul's question is far more fundamental - was my design even necessary?
Forwards or backwards looking
The counterfactual oracle design was setup to reward the oracle for correctly guessing the ultimate value of a random variable (an outcome). This guess was conditional on humans not actually getting to see the guess.
If we ignore the counterfactual aspect for the moment, assume that the oracle has seen a series of past data: pairs (xi,yi). The xi are the background data the Oracle uses to estimate the values of the target variables yi.
Then my design is a forward-looking oracle. If we use a quadratic loss function, at turn t, after seeing xt, it seeks to output zt, defined as:
where x1:ty1:t−1 is the previous data (all xi including xt, all previous yi up to yt−1).
Paul instead recommended using a more traditional machine-learning approach, where you attempt to fit a function f that explains the past data, getting something like this:
where the reg are regularising terms that prevent overfitting. The oracle will then output zt=f(xt). Call such oracle a backwards-facing oracle (BFO), because it only cares about the past.
Then the key facts about the BFO are:
Observer perturbation effects
What happens if we observe the zt - and if this observation changes our behaviour so that yt itself is modified?
Assume the oracle knows our plans and its actions
First of all, note that unless we set up the system very carefully, the oracle will "know" whether we are planning to observe zt or not. The counterfactual oracle was deliberately designed that observing zt is dependent on a purely stochastic process that is under the control of neither the human nor the oracle.
But in general, the Oracle will be able to deduce whether we expect to see the zt; it can glean this information from the xt or from observing patterns in the past xi and yi. For the rest of this post, I'll focus on the situation where we do indeed observe the zt; if we don't, then the BFO and the counterfactual oracle have similar behaviours to each other.
I'll similarly assume that the BFO knows its own past outputs - the zi. It is possible for us to carefully setup the system so that it doesn't track these, but I'll assume that it has access to this information - either implicitly, through whatever process updates it's estimate of the function f, or explicitly, by deducing zi from xi and yi.
Example
Assume for the moment that the xi are null and irrelevant, and that each yi is computed independently from all the others, but with identical distribution.
Ignoring the regularising terms in the cost function above, the best f for the BFO is simply a constant, equal to the mean of the previous yi: f(xt)=∑t−1i=0yi/t.
If we observe zi (which, in this example, we assume we always do), then the value of yi is given by μ(zi)+v(zi), where μ is a function R→R and v(zi) is a random variable with mean zero for all zi. If μ is continuous (and the possible value of yi are in a bounded interval), there will be fixed points where zi=μ(zi). We'll expect the behaviour of the BFO to eventually converge to one of these fixed points. Note that, in the limit, if the BFO converges to the fixed point zi, then the cost function is equal to the expectation of (v(zi))2.
For example, assume μ(zi)=z2i and v≡0 - the relationship between observed zi and actual yi is deterministic. This has two fixed points: one at 0 and one at 1.
The point at 0 is an attractive fixed point, but the point at 1 is unstable. So unless all the data the BFO starts with has zi=1, the BFO will either fail to converge, or converge to zi=0.
We can add a non-trivial v and this result still holds (indeed, a non-trivial v(1) just makes the zi=1 point more unstable). What this means is that, even if the zi=1 results in a lower cost function - ie E[(v(1))2]<E[(v(0))2] - the BFO will still not converge to it, converging to the higher cost point zi=0.
Multiple attractive fixed points
Let μ(zi)=−z3i+(3/2)zi. This has three fixed points, at zi=0 and zi=±1/√2. These last two points are attractive fixed points.
If v≡0, then the BFO will converge to either 1/√2 or −1/√2, depending on the initial data it starts with. If we add a non-trivial v, we get a slight tendency to converge towards the fixed point with lower E[(v(zi))2], everything else being equal. That's because a large E[(v(zi))2] means that the next point sampled is likely to fall outside the "basin of attraction" of zi.
But everything else is not generally equal. The initial data strongly skews the process. The value of the derivative of μ around the fixed points (if it exists) affects the convergence process. The value of E[(v(zi))2] close to the fixed point (rather than at the fixed point) can have a large impact. Luck will play a role as well, if v is non-trivial.
Thus, though there is a tendency to converge to fixed points with lower cost functions, this tendency is very weak. The BFO is not a cost function minimiser.
Manipulation?
It is technically a manipulative act to output a zt such that we will change our value of yt in consequence.
But it's not a manipulative act in most of the ways we fear. For example, if zi was the encoding for a program which, if run, would set yi=zi, along with a convincing message for running that program, then zi is technically a fixed point of μ. However, it is not a fixed point that the BFO is likely to find, because zi+ϵ would not be such an encoding for almost all ϵ, so the basin of attraction for this zi is tiny (basically only those ϵ sufficiently small to preserve all the digits of zi). Thus the BFO is very unlikely to stumble upon it.
So this design does succeed in overcoming many of the problems with oracles: it will not try to manipulate us to obsessively converge on the point with lowest cost function.
Problems with causality
So, what's wrong with the BFO? I've written this much, pointing out its properties without endorsing it, so there must be something wrong.
Actually, I think it works very well for solving some problems, and, for those, is superior to my oracle designs. But it has certain limitations.
First of all, there's the question of what happens if μ has no fixed points (for example, if μ is not continuous in various annoying ways) or no attractive fixed points (for example, if μ(zi)=4zi(1−zi)). Then the BFO may not settle down in any stable way, as it continues to try and find the fixed point of a function without one.
But this problem is just a side effect of the more central point, which is that the BFO has a fundamentally wrong "understanding" of the causality of the situation (this is, of course, an informal phrasing - the BFO has no true "understanding" of the situation, in the ways we mean).
Spurious beliefs
Let us go back to the μ(zi)=−z3i+(3/2)zi and v≡0 situation. Assume the xi is not empty, but is not relevant - for example, it could just be the time of day. Suppose the BFO has made three pairs of observations:
If the BFO treated the xi as irrelevant (which they indeed are), its z3 guess would be 1/(3∗√2), and it would then eventually converge towards the fixed point zi=1/√2 .
But the BFO could also conclude that the parity of the time is relevant, and that it should output 1/√2 during even hours, and −1/√2 during odd hours.
If it does so, it will find confirmation for its theory. As it outputs 1/√2, during even hours, it will observe that that guess was entirely correct, and similarly for odd hours.
In situations where μ has multiple attractive fixed points, and the BFO has a rich and varied amount of previous data for any reason, the BFO will find, and confirm, a spurious causality for explaning these different fixed points.
False models
The problem is that the BFO's greatest strength has become its weakness: it is trying to explain, through function fitting, why the yt seems to have different behaviours. We know that it is because of the zt, the output of the BFO itself; but the BFO cannot take this fact into account. Its function fitting is trying to implicitly take into account the output of its own calculations without explicitly doing so; no wonder that it ends up in Löbian situations.
This might result in the BFO being unable to converge to anything, in situations where the E[||zt−yt||2∣∣x1:ty1:t−1]-minimising oracle could.
For example, we could imagine an environment where the xi and the yi are very varied, but the underlying causal structure is quite simple. An oracle that tracked zi as an input to the environment could deduce this causal structure quite easily.
But the BFO might struggle, as it tries to fit a function where it can't explicitly take a key fact into account. Because the xi and yi are so varied, it should stumble upon many different spurious beliefs, making the job of fitting a single function to all the data extremely difficult. It would be interesting to explore the extent to which this might become a problem.
In conclusion
The BFO/traditional machine learning approach held up better than I supposed, and I'm thankful to Paul for bringing it to my attention. It has interesting advantages and drawbacks, and could be a useful oracle design in many situations.