Here's a simple toy model that illustrates the difference between 2 and 3 (that doesn't talk about attention layers, etc.).
Say you have a bunch of triplets . Your want to train a model that predicts from and from .
Your model consists of three components: . It makes predictions as follows:
(Why have such a model? Why not have two completely separate models, one for predicting and one for predicting ? Because it might be more efficient to use a single both for predicting and for predicting , given that both predictions presumably require "interpreting" .)
So, intuitively, it first builds an "inner representation" (embedding) of . Then it sequentially makes predictions based on that inner representation.
Now you train and to minimize the prediction loss on the parts of the triplets. Simultaneously you train to minimize prediction loss on the full triplets. For example, you update and with the gradients
and you update and with the gradients
.
(The here is the "true" , not one generated by the model itself.)
This training pressures to be myopic in the second and third sense described in the post. In fact, even if we were to train with the predicted by rather than the true , is pressured to be myopic.
Of course, still won't be pressured to be type-1-myopic. If predicting requires predicting , then will be trained to predict ("plan") .
(Obviously, $g_2$ is pressured to be myopic in this simple model.)
Now what about ? Well, is optimized both to enable predicting from and predicting from . Therefore, if resources are relevantly constrained in some way (e.g., the model computing is small, or the output of is forced to be small), will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for (and thus in some sense the overall model) can and will sacrifice accuracy on to achieve better accuracy on . In particular, we should expect trained models to find an efficient tradeoff between accuracy on and accuracy on . When is relatively easy to predict, will spend most of its computation budget on predicting .
So, is not "Type 2" myopic. Or perhaps put differently: The calculations going into predicting aren't optimized purely for predicting .
However, is still "Type 3" myopic. Because the prediction made by isn't fed (in training) as an input to or the loss, there's no pressure towards making influence the output of in a way that has anything to do with . (In contrast to the myopia of , this really does hinge on not using in training. If mattered in training, then there would be pressure for to trick into performing calculations that are useful for predicting . Unless you use stop-gradients...)
* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.
In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
I'm quite skeptical of this claim on face value, and would love to see examples.
I'd be very surprised if current models, absent the default prompts telling them they are an LLM, would spontaneously output text predicting they are an LLM unless steered in that direction.
I can vouch that I have had the same experience (but am not allowed to share outputs of the larger model I have in mind). First encountered via curation without intentional steering in that direction, but I would be surprised if this failed to replicate with an experimental setup that selects completions randomly without human input. Let me know if you have such a setup in mind that you feel is sufficiently rigorous to act as a crux.
Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.
Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture.
TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text.
Summary
Introduction
The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work.
Types of Myopia
1. Cognitive Myopia
One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabilities handicap, as many tasks require some degree of forward planning or anticipation of future events.
LLM pretraining is not cognitively-myopic. Even though LLMs like GPT are optimized for next-token prediction and use causal masking which hides the future from current predictions, there is still a direct incentive to think about the future because it can be useful for immediate prediction accuracy. In a game of chess, efficiently computing the best move likely involves reasoning about how your opponent will respond multiple moves into the future. In other words, we should expect GPT to think about the future because it’s instrumental for predicting the present.
2. Value/Prediction Myopia
Value myopia refers to agents that place no value on future states or rewards. In the case of a predictive model, we mean the model cares only about the accuracy of its next prediction. Intuitively, it seems like this would rule out deceptive alignment and treacherous turns, because a value-myopic agent should be unwilling to sacrifice immediate value for long-term value. Unfortunately, value-myopic agents might still have non-myopic incentives for reasons like anthropic uncertainty or acausal trade. (see Open Problems with Myopia for more details)
In any case, LLM training actually incentivizes value non-myopia. Even though next-token prediction accuracy is a nominally myopic objective, the transformer architecture actually causes the model to be optimized for accuracy over the entire sequence, not just the next token.
In a causal-masked transformer, attention layers can query the previous layers' activations from any column in the context window. Gradients flow through the attention connections, so each previous layer is optimized not just to improve prediction accuracy for the next token, but also to produce values that are useful for future columns to attend to when predicting their token.
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. In particular, we should expect trained models to find an efficient tradeoff between current prediction accuracy and full-sequence prediction accuracy. When predicting relatively easy tokens, they will likely spend most of their computation budget preparing for the future. If the model is given a hard problem that it will need to solve later in the sequence, it will likely work on the problem ('in the back of its head') throughout the intermediate predictions.
Experiments
Interpretability
Several interpretability results including ROME confirm this type of non-myopic computation in LLMs. ROME shows that LLMs recall factual information about nouns when the noun first appears, even though this information is only used later when predicting the answer to a question about the noun. This information would be irrelevant and thus wasted computation for the purpose of predicting only the next token. For example, if the model sees the text "The Eiffel Tower", it immediately begins retrieving information about the Eiffel Tower like where it is located even though that's not necessary to predict the next token which is almost certainly "is".
Enforcing Myopia
It is possible to modify the transformer architecture to enforce value (prediction accuracy) myopia by placing stop gradients in the attention layers. This effectively prevents past activations from being directly optimized to be more useful for future computation. We ran several informal experiments on models like these while at Conjecture. Unfortunately, we do not have quantitative results to share here. The experiments were preliminary and we moved on to other aspects of the project, so don’t take this as strong evidence.
Specifically, we trained a set of four traditional and four myopic transformers ranging from 117M to 1.5B parameters (equivalent to GPT-2 Small to GPT-XL). Each model was trained on the same data but training hyperparameters were tuned to each architecture individually using maximal update parameterization.
We found the performance reduction from myopia was minimal at 117M parameters, but the performance cost increased with scale, i.e. myopic transformers have worse scaling laws.
3. Consequence-blindness
A third type of myopia to consider is consequence-blindness, where a model chooses actions completely independent of any effect of its actions on the future. This is similar to the goal of Counterfactual Oracles.
Consequence-blindness should rule out most types of instrumental convergence and concerns about self-fulfilling prophecies. A model which completely ignores the effects of its actions has no instrumental incentive to pursue traditional instrumental goals, like trying to accumulate resources to become more powerful, trying to prevent its own shutdown, or pretending to be aligned in order to defect later. However, consequence-blindness does not actually constrain the behavior of a model, because the model can pursue any instrumental goal as a terminal value.
A consequence-blind simulator that predicts power-seeking agents (like humans) will still predict actions which seek power, but these actions will seek power for the simulated agent, not the simulator itself. I usually think about problems like this as simulator vs simulacra alignment. If you successfully build an inner aligned simulator, you can use it to faithfully simulate according to the rules it learns and generalizes from its training distribution. However you are still left with the problem of extracting consistently aligned simulacra.
In theory, consequence-blindness doesn't rule out any capabilities, because a consequence-blind predictor could learn to predict any behavior. However, in practice using a consequence-blind training goal like pure imitation learning may be uncompetitive compared to methods like RL (or imitation + RL finetuning, the current dominant paradigm).
Consequence-blind agents (with a causal decision theory) can be seen as implementing a Lonely Causal Decision Theory (LCDT). An LCDT agent assumes that every other decision node of agents in the world (including its future decisions) are causally independent of its actions. This means it has no incentive to take actions which help its future itself or other agents for instrumental reasons.
Unlike the other forms of myopia above, the training goal for LLMs trained with self-supervised learning (SSL) is theoretically consequence-blind. In supervised or self-supervised learning, the training data already exists and is assumed to be causally independent from the model’s decisions. This means a model’s prediction should be based only on the likelihood of the output appearing in the training data. In particular, the model’s prediction should be independent of any effect from making the prediction itself, including whether or not the prediction would make the model more likely to predict or control the future correctly when run autoregressively.
The distinction between optimizing prediction accuracy and steering the distribution to be easier to predict is one of the most common sources of confusion about LLM myopia. Even though the LLM training goal is not value-myopic and optimizes for prediction accuracy across entire training examples, LLMs are not incentivized to predict tokens that make the future easier to predict.
Unfortunately, the analysis of consequence-blindness story becomes more complicated when the training data is not actually causally independent from the model’s prediction.
Self-reference and Model Generated Data
In practice, the assumption that the supervised learning distribution is causally independent from the model’s predictions may be violated. This can happen whenever a model is trained on data correlated with its behavior, or if the model thinks the correct prediction is correlated with its behavior.
This is concerning because it's not at all clear what a model that is predicting itself should output. It breaks many of the intuitions of why it should be safe to use LLMs as simulators of text distributions.
This problem is already apparent in current LLMs and is likely to become worse for several reasons.
We could try to combat this problem by excluding model generated text from all future training data, but it is still possible for a model trained on human-written text to drift off distribution and become situationally aware. You could see this as a training goal (outer alignment) problem: the decision-myopic base objective has to generalize somehow to self-referential predictions which are fundamentally not consequence-blind.
We could also try the opposite of actually increasing the amount of model generated text in the training dataset. This could allow us to shape the LLMs prior on how LLMs (and thus itself) should behave. However it is still possible for a model to realize (or believe) that it is different from the types of models in the training data and should thus act differently.
In this vein, there have been several great posts that discuss how the process of repeatedly training a model on its own predictions might converge. There has also been work considering the general implications of training a simulator in a closed loop. However the broader implications for generalization and alignment are unclear.
Myopic Training Goals vs Myopic Models
It is also important to note that even if a training goal is designed to be myopic in some way, the resulting model may not be myopic. Inner alignment failures can lead to non-myopic models emerging from myopic training goals. Finding a solution to inner alignment, or getting inner alignment by default does seem relatively likely for predictive SSL over other training goals, but it is not guaranteed. Many researchers believe the cognitive structures that are required to predict the answers to hard consequentialist problems will fundamentally be non-myopic, especially if these structures become situationally aware. Some examples.
It would be a huge success if we could find some way to enforce or verify that a model's internal computation satisfies some myopic criteria (or any criteria…) during or after training. However, it's not clear how we would go about this.
Meta
The ideas in the post are from a human, but most of the text was written by Chat GPT-4 with prompts and human curation using Loom. I endorse the post as technically correct and accurately phrased according to my understanding. Here is the second of two Loom trees used to generate most of the post before final edits.