In a recent post, Wei Dai mentions a similar distinction (italics added by me):
Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen.
I think I've updated a bit from when I wrote this (due to this discussion). (ETA: I've now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:
I'm going to be a bit more explicit about some ideas that appeared in The Parable of Predict-O-Matic. (If you don't want spoilers, read it first. Probably you should read it first anyway.)
[Note: while the ideas here are somewhat better than the ideas in the predict-o-matic story, they're equally rambling, without the crutch of the story to prop them up. As such, I expect readers to be less engaged. Unless you're especially interested in which character's remarks are true (or at least, which ones I stand by), this might be a post to skim; I don't think it has enough coherence that you need to read it start-to-finish.]
First, as I mentioned in Partial Agency, my main concern here isn't actually about building safe oracles or inner-aligned systems. My main concern is to understand what's going on. If we can build guaranteed-myopic systems, that's good for some purposes. If we can build guaranteed-non-myopic systems, that's good for other purposes. The story largely frames it as a back-and-forth about whether things will be OK / whether there will be terrible consequences; but my focus was on the more specific questions about the behavior of the system.
Second, I'm not trying to confidently stand behind any of the character's views on what will happen. The ending was partly intended to be "and no one got it right, because this stuff is very complicated". I'm very uncertain about all of this. Part of the reason why it was so much easier to write the post as a story was that I could have characters confidently explain views without worrying about adding all the relevant caveats.
Inductive Bias
Evan Hubinger pointed out to me that all the characters are talking about asymptotic performance, and ignoring inductive bias. Inner optimizers might emerge due to the inductive bias of the system. I agree; in my mind, the ending was a bit of a hat tip to this, although I hinted at gradient hacking rather than inductive bias in the actual text.
On the other hand, "inductive bias" is a complicated object when you're talking about a system which isn't 100% Bayesian.
Explicit Fixed-Point Selection
The very first conversation involved the intern arguing that there would be multiple valid fixed-points of prediction, and Predict-O-Matic would have to choose between them somehow.
Explicitly modeling fixed points and choosing between them is a feature of the logical induction algorithm. This feature allows us to select the best one according to some criterion, as is leveraged in When Wishful Thinking Works. As discussed later in the conversation with the mathematician, this is atypical of supervised learning algorithms. What logical induction does is very expensive: it solves a computationally difficult fixed-point finding problem (by searching exhaustively).
Other algorithms are not really "choosing a fixed point somehow". They're typically failing to guarantee a fixed point. The mathematician hinted at this by describing how algorithms would not necessarily converge to a self-fulfilling prophecy; they could just as easily go in circles or wander around randomly forever.
Think of it like fashion. Sometimes, putting a trend into common knowledge will lock it in; this was true about neck ties in business for a long time. In other instances, the popularity of a fashion trend will actually work against it, a fashion statement being ineffective if it's overdone.
So, keep in mind that different learning procedures will relate to this aspect of the problem in different ways.
Reward vs Prediction Error
The economist first compared the learning algorithm to decision markets, then later, decided prediction markets were a better analogy.
The mathematician contrasted the learning algorithm to reinforcement learning, pointing out that Predict-O-Matic always adjusted outputs to be more like historical observations, whereas reinforcement learning would more strategically optimize reward.
Both of these point at a distinction between learning general decision-making and something much narrower and much more epistemic in character. As I see it, the critical idea is that (1) the system gets information about what it should have output; (2) the learning update moves toward a modified system which would have output that. This is quite different from reinforcement learning.
In a recent post, Wei Dai mentions a similar distinction (italics added by me):
There are several things going on here, but I think Wei is trying to point at something similar to the distinction I'm thinking of. It's quite tempting to call it "supervised learning", because you get a signal telling you what you should have done. However, it's a bit fuzzy, because this also encompasses things normally called "unsupervised learning": the supervised/unsupervised distinction is often explained as modeling P(x|y) vs P(x). Wikipedia:
But many (not all) unsupervised algorithms still have the critical features we're interested in! Predicting x without any context information y to help still involves (1) getting feedback on what we "should have" expected, and (2) updating to a configuration which would have more expected that. We simply can't expect the predictions to be as focused, given the absence of contextual information to help. But that just means it's a prediction task on which we tend to expect lower accuracy.
I'm somewhat happy referring to this category as imitative learning. This includes supervised learning, unsupervised learning so long as it's generative (but not otherwise), and imitation learning (a paradigm which achieves similar ends as inverse reinforcement learning). Homever, the terminological overlap with 'imitation learning' is rather terrible, so I'm open to other suggestions.
It seems to me that this is a critical distinction for the myopia discussion. I hope to say more about it in future posts.
Maximizing Entropy?
The discussion of prediction markets toward the end was rather loose, in that the economist didn't deal with a lot of the other points which had been made throughout, and just threw a new model out there.
Isnasene interpreted the first point by imagining that the mechanism of manipulation is still through selection of fixed points:
This is compatible with the assumption of myopia; we might imagine that the system still can't manipulate events through actual bad predictions, because those strategies will be undercut. Therefore, the manipulation is restricted to selecting fixed-points which are surprising.
However, there are three problems with this:
So, it seems the actual situation is more complicated, and I'm not yet sure how to think about this.
'Local Search'; selection vs control
I used the term 'local search' to describe the application of gradient-descent-like updates to reduce prediction error. I have some conceptual/terminological issues with this.
Calling this 'local search' invokes the mental image of a well-defined gradient landscape which we are taking steps on, to further optimize some function. But this is the wrong mental image. The mental image is one of selection, when we're in a control setting (in my terminology). We are not making an iid assumption. We are not getting samples from a stationary but stochastic loss function, as in stochastic gradient descent.
If 'local search' were an appropriate descriptor for gradient-descent here, would it also be an appropriate descriptor for Bayesian updates? There's a tendency to think of Bayesian learning as trying to find one good hypothesis by tracking how well all of them do (which sounds like a global search), but we needn't think of it this way. The "right answer" can be a mixture over hypotheses. We can think of a Bayesian update as incrementally improving our mixture. But thinking of Bayesian updates as local search seems wrong. (So does thinking of them as global search.)
This is online learning. A gradient-descent step represents a prediction that the future will be like the past in some relevant sense, in spite of potential non-stationarity. It is not a guaranteed improvement, even in expectation -- as it would be in offline stochastic gradient descent with sufficiently small step size.
Moreover, step size becomes a more significant problem. In offline gradient descent, selecting too small a step size only means that you have to make many more steps to get where you're going. It's "just a matter of computing power". In online learning, it's a more serious problem; we want to make the appropriate-sized update to new data.
I realize there are more ways of dealing with this than tuning step size; we don't necessarily update to data by making a single gradient step. But there are problems of principal here.
What's gradient descent without a fitness landscape?
Simply put, gradient descent is a search concept, not a learning concept. I want to be able to think of it more directly as a learning concept. I want to be able to think of it as an "update", and use terminology which points out the similarity to Bayesian updates.
The Duality Remark
Vanessa asked about this passage:
I responded:
If you have an outer optimizer which is trying to maximize f(x,y) through x while being indifferent about y, it seems sensible to suppose that inner optimizers will want to change y to throw things off, particularly if they can get credit for then correcting x to be optimal for the new y. If so, then inner optimizers will generally be seeking to find y-values which make the current x a comparatively bad choice. So this argument does not establish an incentive to choose y which makes all choices of x poor.
In a log-loss setting, this would translate to an incentive to make observations surprising (for the current expectations), rather than a direct incentive to make outcomes maximum-entropy. However, iteration of this would push toward maximum entropy. Or, logical-induction-style fixed-point selection could push directly to maximum entropy.
This would be a nice example of partial agency. The system is strategically influencing x and y so as to maximize f through channel x, while minimizing through channel y. What does this mean? This does not correspond to a coherent objective function at all! The system is 'learning a game-theoretic equilibrium' -- which is to say, it's learning to fight with itself, rather than optimize.
There are two different ways we can think about this. One way is to say there's an inner alignment problem here: the system learns to do something which doesn't fit any objective, so it's sort of trivially misaligned with whatever the outer objective was supposed to be. But what if we wanted this? We can think of games as a kind of generalized objective, legitimizing this behavior.
To make things even more confusing, if the only channel by which Predict-O-Matic can influence the world is via the predictions which get output, then... doesn't x=y? x represents the 'legitimate' channel whereby predictions get combined with (fixed) observations to yield a score. y represents the 'manipulative' channel, where predictions can influence the world and thus modify observations. But the two causal pathways have one bottleneck which the system has to act through, namely, the predictions made.
In any case, I don't particularly trust any of the reasoning above.
So, I'm still unsure how to think about all this.