The present discussion owes a lot to Scott Garrabrant and Evan Hubinger.

In Defining Myopia, I formalized temporal or cross-instance myopia / non-myopia, but I claimed that there should also be some kind of single-instance myopia which I hadn't properly captured. I also suggested this in Predict-O-Matic.

This post is intended to be an example of single-instance partial agency.

Evolving to Extinction

Evolution might be myopic in a number of ways, but one way is that it's myopic across individuals -- it typically produces results very different from what group selection would produce, because it's closer to optimizing relative fitness of individuals (relative to each other) than it is to optimizing overall fitness. Adaptations which help members of a species compete with each other are a great example of this. Why increase your own fitness, when you can just decrease someone else's instead? We're lucky that it's typically pretty hard, at least historically, to do things which are bad across the board but slightly less bad for the one doing them. Imagine a "toxic gas gene" which makes the air harder for everyone to breathe, but slightly less so for carriers of the gene. Such a gene would be selected for. This kind of thing can be selected for even to the point where it drives the population of a species right down to zero, as Eliezer's essay on evolving to extinction highlighted.

Actually, as Eliezer's essay emphasized, it's not even that evolution is myopic at the level of individuals; evolution is myopic down to the level of individual genes, an observation which better explains the examples of evolving-to-extinction which he discusses. (This is, of course, the point of Dawkins' book The Selfish Gene.) But the analogy of myopia-across-individuals will suit me better here.

Bayes "Evolving to Extinction"

The title of this post is a hyperbole, since there isn't an analog of an extinction event in the model I'm about to describe, but it illustrates that in extreme circumstances a Bayesian learner can demonstrate the same kind of pathological behavior that evolution does when it ends up selecting for relative fitness in a way which pumps against absolute fitness.

Like evolution, Bayes' Law will "optimize"[1] for relative fitness of hypotheses, not absolute fitness. Ordinarily there isn't enough of a difference for this to matter. However, I've been discussing scenarios where the predictor can significantly influence what's being predicted. Bayes' Law was not formulated with examples like this in mind, and we can get pathological behavior as a result.

One way to construct an example is to imagine that there is a side-channel by which hypotheses can influence the world. The "official" channel is to output predictions; but let's say the system also produces diagnostic logs which predictors can write to, and which humans read. A predictor can (for example) print stock tips into the diagnostic logs, to get some reaction from humans.

Say we have a Bayesian predictor, consisting of some large but fixed number of hypotheses. An individual hypothesis "wants" to score well relative to others. Let's also say, for the sake of argument, that all hypotheses have the ability to write to diagnostic logs, but humans are more likely to pay attention to the diagnostics for more probable hypotheses.

How should a hypothesis make use of this side-channel? It may initially seem like it should use it to make the world more predictable, so that it can make more accurate predictions and thus get a better score. However, this would make a lot of hypotheses score better, not just the one printing the manipulative message. So it wouldn't really be selected for.

Instead, a hypothesis could print manipulative messages designed to get humans to do things which no other hypothesis anticipates. This involves specifically optimizing for events with low probability to happen. Hypotheses which successfully accomplish this will get a large boost in relative predictive accuracy, making them more probable according to Bayes' Law.

So, a system in this kind of situation eventually winds up being dominated by hypotheses which manipulate events to be as unpredictable as possible (by that very system), subject to the constraint that one hypothesis or another within the system can predict them.

This is very much like what I called the entropy-market problem for futarchy, also known as the assasination-market problem. (Any prediction market involving the lifespan of public figures is equivalent to an assassination market; it pays for the death of public figures, since that is a hard-to-predict but easier-to-control event.)

Analogous problems arise if there is no side-channel but the prediction itself can influence events (which seems very plausible for realistic predictions).

Is This Myopia?

If we use "myopia" to point to the kind of non-strategic behavior we might actually want out of a purely predictive system, this isn't myopia at all. For this reason, and for other reasons, I'm more comfortable throwing this under the umbrella term "partial agency". However, I think it's importantly related to myopia.

  • Just like we can think of evolution as myopically optimizing per-individual, uncaring of overall harm to reproductive fitness if that harm went along with improvements to individual relative fitness, we can think of Bayes' Law as myopically optimizing per-hypothesis, uncaring of overall harm to predictive accuracy.
  • The phenomenon here doesn't illustrate the "true myopia" we would want of a purely predictive system, since it ends up manipulating events. However, it at least shows that there are alternatives. One might have argued "sure, I get the idea of cross-instance myopia, showing that per-instance optimization is (possibly radically) different from cross-instance optimization. But how could there be per-instance myopia, as distinct from per-instance optimization? How can partial agency get any more partial than myopically optimizing individual instances?" Bayes-evolving-to-extinction clearly shows that we can break things down further. So perhaps there's still room for a further "true myopia" which codifies non-manipulation even for single instances.
  • This phenomenon also continues the game-theoretic theme. Just as we can think of per-instance myopia as stopping cross-instance optimization by way of a Molochian race-to-the-bottom, we see the same thing here.

Neural Nets / Gradient Descent

As I've mentioned before, there is a potentially big difference between multi-hypothesis setups like Bayes and single-hypothesis setups like gradient-descent learning. Some of my arguments, like the one above, involve hypotheses competing with each other to reach Molochian outcomes. We need to be careful in relating this to cases like gradient descent learning, which might approximate Bayesian learning in some sense, but incrementally modifies a single hypothesis rather than letting many hypotheses compete.

One intuition is that stochastic gradient descent will move the network weights around, so that we are in effect sampling many hypotheses within some region. Under some circumstances, the most successful weight settings could be the ones which manipulate things to maximize local gradients in their general direction, which means punishing other nearby weight configurations -- this could involve increasing the loss, much like the Bayesian case. (See Gradient Hacking.)

There is also the "lottery ticket hypothesis" to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

More formally, though, we can point to some things which are moderately analogous, but not perfectly.

If we are adapting a neural network using gradient descent, but there is a side-channel which we are not accounting for in our credit assignment, then the gradient descent will not optimize the side-channel. This might result in aimless thrashing behavior.

For example, suppose that loss explicitly depends only on the output X of a neural net (IE, the gradient calculation is a gradient on the output). However, actually the loss depends on an internal node Y, in the following way:

  • When |X-Y| is high, the loss function rewards X being high.
  • When |X-Y| is low, the loss function rewards X being low.
  • When X is high, the loss function rewards low |X-Y|.
  • When X is low, the loss function rewards high |X-Y|.
  • When both values are middling, the loss function incentivizes X to be less middling.

This can spin around forever. It is of course an extremely artificial example, but the point is to demonstrate that when gradient descent does not recognize all the ways the network influences the result, we don't necessarily see behavior which "tries to reduce loss", or even appears to optimize anything.


  1. The whole point of the partial agency sequence is that words like "optimize" are worryingly ambiguous, but I don't have sufficiently improved terminology yet that I feel I can just go ahead and use it while maintaining clarity!! In particular, the sense in which Bayesian updates optimize for anything is pretty unclear when you think about it, yet there is certainly a big temptation to say that they optimize for predictive accuracy (in the log-loss sense). ↩︎

New Comment
13 comments, sorted by Click to highlight new comments since:

How necessary is it that there be an explicit side-channel? Could you not get the same results in the standard situation in which an agent is selecting actions on the basis of expected utility?

Ah right! I meant to address this. I think the results are more muddy (and thus don't serve as clear illustrations so well), but, you do get the same thing even without a side-channel.

Ok, cool. Probably not high on your list of priorities, but I for one would be interested to see explanation of that case. At the very least it is of philosophical interest -- it could be presented as a general problem with Bayesianism, or perhaps even more broadly, with the principle that any logically consistent hypothesis should get non-zero credence.

There is also the "lottery ticket hypothesis" to consider (discussed on LW here and here) -- the idea that a big neural network functions primarily like a bag of hypotheses, not like one hypothesis which gets adapted toward the right thing. We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

This is a fascinating point. I'm curious now how bad things can get if your lottery tickets have side channels but aren't deceptive. It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it's simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly. This seems likely to depend on how powerful your base optimization process is and how easy it is to influence the world through side-channels. If it's the case that you need deception, then this probably isn't any worse than the gradient hacking problem (though possibly it gives us more insight into how gradient hacking might work)—but if it can happen without deception, then this sort of evolving-to-extinction behavior could be a serious problem in its own right.

Suppose I have some active learning setup, where I decide new points to investigate based on expected uncertainty reduction, or update to the model weights, or something. Then it seems like the internals of the model could be an example of these diagnostic prediction logs being relevant without having to have humans look at them. Then it seems like there might be competition among subnetworks to have the new training examples be places where they'll do particularly well, or to somehow avoid areas where they'll do poorly. 

I have a hard time making this story one where this is a bug instead of a feature, tho; in order for a subnetwork to do particularly well, it has to know something about the real data-generating distribution that the rest of the model doesn't. This only looks pathological if the thing that it knows is manufactured by the model, somehow. (Like, if I can write a fictional story and win trivia contests based on my fictional story, then I can hack points.)

It might be that the evolving-to-extinction policy of making the world harder to predict through logs is complicated enough that it can only emerge through a deceptive ticket deciding to pursue it—or it could be the case that it's simple enough that one ticket could randomly start writing stuff to logs, get selected for, and end up pursuing such a policy without ever actually having come up with it explicitly.

I'm not sure about the latter. Suppose there is a "simple" ticket that randomly writes stuff to the logs in a way that makes future training examples harder to predict. I don't see what would cause that ticket to be selected for.

If that ticket is better at predicting the random stuff it's writing to the logs—which it should be if it's generating that randomness—then that would be sufficient. However, that does rely on the logs directly being part of the prediction target rather than only through some complicated function like a human seeing them.

I'm confused about what it means for a hypothesis to "want" to score better, to change its predictions to get a better score, to print manipulative messages, and so forth. In probability theory each hypothesis is just an event, so is static, cannot perform actions, etc. I'm guessing you have some other formalism in mind but I can't tell what it is.

I interpreted it as an ensemble of expert models, weighted in a Bayesian fashion based on past performance. But because of the diagnostic logs, the type signature is a little different; the models output both whatever probability distributions over queries / events and arbitrary text in some place.

Then there's a move that I think of as the 'intentional stance move', where you look at a system that rewards behavior of a particular type (when updating the weights based on past success, you favor predictions that thought an event was more likely than its competitors did), and so pretend that the things in the system "want" to do the behavior that's rewarded. [Like, even in this paragraph, 'reward' is this sort of mental shorthand; it's not like any of the models have an interior preference to have high weight in the ensemble, it's just that the ensemble's predictions are eventually more like the predictions of the models that did things that happened to lead to having higher weight.]

Yeah, in probability theory you don't have to worry about how everything is implemented. But for implementations of Bayesian modeling with a rich hypothesis class, each hypothesis could be something like a blob of code which actually does a variety of things.

As for "want", sorry for using that without unpacking it. What it specifically means is that hypotheses like that will have a tendency to get more probability weight in the system, so if we look at the weighty (and thus influential) hypotheses, they are more likely to implement strategies which achieve those ends.

Planned summary for the Alignment Newsletter:

Consider a Bayesian learner, that updates the weights of various hypotheses using Bayes Rule. If the hypotheses can influence future events and predictions (for example, maybe it can write out logs, which influence what questions are asked in the future), then hypotheses that affect the future in a way that only they can predict will be selected for by Bayes Rule, rather than hypotheses that straightforwardly predict the future without trying to influence it. In some sense, this is "myopic" behavior on the part of Bayesian updating: Bayes Rule only optimizes per-hypothesis, without taking into account the effect on overall future accuracy. This phenomenon could also apply to neural nets if the <@lottery ticket hypothesis@>(@The Lottery Ticket Hypothesis: Training Pruned Neural Networks@) holds: in this case each "ticket" can be thought of as a competing hypothesis.
we can think of Bayes' Law as myopically optimizing per-hypothesis, uncaring of overall harm to predictive accuracy.

Or just bad implementations do this - predict-o-matic as described sounds like a bad idea, and like it doesn't contain hypotheses, so much as "players"*. (And the reason there'd be a "side channel" is to understand theories - the point of which is transparency, which, if accomplished, would likely prevent manipulation.)

We can imagine different parts of the network fighting for control, much like the Bayesian hypotheses.

This seems a strange thing to imagine - how can fighting occur, especially on a training set? (I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it's not clear where the "tickets" are.)

*I don't have a link to the claim, but it's been said before that 'the math behind Bayes' theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.'

Or just bad implementations do this - predict-o-matic as described sounds like a bad idea, and like it doesn't contain hypotheses, so much as "players"*. (And the reason there'd be a "side channel" is to understand theories - the point of which is transparency, which, if accomplished, would likely prevent manipulation.)

You can think of the side-channel as a "bad implementation" issue, but do you really want to say that we have to forego diagnostic logs in order to have a good implementation of "hypotheses" instead of "players"? Going to the extreme, every brain has side-channels such as EEG.

But more importantly, as Daniel K pointed out, you don't need the side-channel. If the predictions are being used in a complicated way to make decisions, the hypotheses/players have an incentive to fight each other through the consequences of those decisions.

So, the interesting question is, what's necessary for a *good* implementation of this?

This seems a strange thing to imagine - how can fighting occur, especially on a training set?

If the training set doesn't provide any opportunity for manipulation/corruption, then I agree that my argument isn't relevant for the training set. It's most directly relevant for online learning. However, keep in mind also that deep learning might be pushing in the direction of learning to learn. Something like a Memory Network is trained to "keep learning" in a significant sense. So you then have to ask if its learned learning strategy has these same issues, because that will be used on-line.

(I can almost imagine neurons passing on bad input, but a) it seems like gradient descent would get rid of that, and b) it's not clear where the "tickets" are.)

Simplifying the picture greatly, imagine that the second-back layer of neurons is one-neuron-per-ticket. Gradient descent can choose which of these to pay the most attention to, but little else; according to the lottery ticket hypothesis, the gradients passing through the 'tickets' themselves aren't doing that much for learning, besides reinforcing good tickets and weakening bad.

So imagine that there is one ticket which is actually malign, and has a sophisticated manipulative strategy. Sometimes it passes on bad input in service of its manipulations, but overall it is the best of the lottery tickets so while the gradient descent punishes it on those rounds, it is more than made up for in other cases. Furthermore, the manipulations of the malign ticket ensure that competing tickets are kept down, by manipulating situations to be those which the other tickets don't predict very well.

*I don't have a link to the claim, but it's been said before that 'the math behind Bayes' theorem requires each hypothesis to talk about all of the universe, as opposed to human models that can be domain limited.'

This remark makes me think you're thinking something about logical-induction style traders which only trade on a part of the data vs bayesian-style hypotheses which have to make predictions everywhere. I'm not sure how that relates to my post -- there are things to say about it, but, I don't think I said any of them. In particular the lottery-ticket hypothesis isn't about this; a "lottery ticket" is a small part of the deep NN, but, is effectively a hypothesis about the whole data.