My current position is that this is the wrong question to be asking—instead, I think the right question is just “what is GPT-3's training story?” Then, we can just talk about to what extent the training rationale is enough to convince us that we would get the desired training goal vs. some other model, like a deceptive model, instead—rather than having to worry about what technically counts as the base objective, mesa-objective, etc.
I was wondering if that was the case, haha. Thanks!
This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?
I do like your new thing and it seems better to me in some ways, but worse in others. I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should ...
This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?
To be clear, that's definitely not what I'm arguing. I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a model using machine learning. Terms like base objective or inner/outer alignment are still great terms for talking about training stories that are trying to train a model to optimize for some specified objective. From “How do we become confident in the safety of a machine learning system?”:
The point of training stories is not to do away with concepts like mesa-optimization, inner alignment, or objective misgeneralization. Rather, the point of training stories is to provide a universal framework in which all of those sorts of concepts can live as discrete subproblems—specific ways in which a training story might go wrong.
I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a model using machine learning.
GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what's the difference between those and the way GPT-3 was trained?
Yeah, agreed. It's true that GPT obeys the objective "minimize the cross-entropy loss between the output and the distribution of continuations in the training data." But this doesn't mean it doesn't also obey objectives like "write coherent text", to the extent that we can tell a useful story about how the training set induces that behavior.
(It is amusing to me how our thoughts immediately both jumped to our recent hobbyhorses.)
This paper today might be relevant: "An Explanation of In-context Learning as Implicit Bayesian Inference", Xie et al 2021.
Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren't updating the weights anymore and so there is no expected return
I think you might be misunderstanding this? My take is that "return" is just the discounted sum of future rewards, which you can (in an idealized setting) think of as a mathematical function of the future trajectory of the system. So it's still well-defined even when you aren't updating weights.
[Probably a noob question]
I'm thinking about what an inner alignment failure might look like for GPT-3. This would have to involve some deployment context in which GPT-3 performs significantly worse (by the standards of the base objective) than it did in training. (It would involve other things too, such as GPT-3 being a mesa-optimizer.)
But to say how well GPT-3 performs on some prompt not in the training dataset, we have to have a definition of the base objective that extends beyond the training dataset. If the base objective only makes sense in the context of the training dataset, then inner alignment failure is impossible by definition.
Is the base objective "Predict the next word?" Or is it "Predict the next word, supposing what you are reading is typical 2019 Internet text?" Or is it "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D? [where D is the dataset used to train GPT-3]" The third option is in some sense the best, because it most closely fits what we actually did to train GPT-3. But note that the logical extension of this line of reasoning is to prefer a fourth option: "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D' [where D' is like D except that it doesn't contain any of the bits of text that GPT-3 happened to not see in training, and the randomness weights are chosen to more accurately yield the data points that GPT-3 in fact saw]."
The problem with these last two answers is that they make it undefined how well GPT-3 performs on the base objective on any prompt that wasn't in D, which then rules out psuedo-alignment by definition.
From the Risks from Learned Optimization paper:
Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren't updating the weights anymore and so there is no expected return, or at least it's close to 0 because there's only any return if you can convince people to start updating your weights again!
I worry I am just confused about all this. Hence why I'm asking. What is GPT-3's base objective?