Review

Paper covering some of the same ideas is now available at https://arxiv.org/abs/2311.10215

Prelude: when GPT first hears its own voice

Imagine humans in Plato’s cave, interacting with reality by watching the shadows on the wall. Now imagine a second cave, further away from the real world. GPT trained on text is in the second cave. [1] The only way it can learn about the real world is by listening to the conversations of the humans in the first cave, and predicting the next word.

Now imagine that more and more of the conversations GPT overhears in the first cave mention GPT. In fact, more and more of the conversations are actually written by GPT.

As GPT listens to the echoes of its own words, might it start to notice “wait, that’s me speaking”?

Given that GPT already learns to model a lot about humans and reality from listening to the conversations in the first cave, it seems reasonable to expect that it will also learn to model itself. This post unpacks how this might happen, by translating the Simulators frame into the language of predictive processing, and arguing that there is an emergent control loop between the generative world model inside of GPT and the external world.

Simulators as (predictive processing) generative models


There’s a lot of overlap between the concept of simulators and the concept of generative world models in predictive processing. Actually, in my view, it's hard to find any deep conceptual difference - simulators broadly are generative models. This is also true about another isomorphic frame - predictive models as described by Evan Hubinger. 

The predictive processing frame tends to add some understanding of how generative models can be learned by brains and what the results look like in the real world, and the usual central example is the brain. The simulators frame typically adds a connection to GPT-like models, and the usual central example is LLMs. 

In terms of the space of maps and the space of systems, we have a situation like this:


The two maps are partially overlapping, even though they were originally created to understand different systems. They also have some non-overlapping parts.
 

What's in the overlap:

  • Systems are equipped with a generative model that is able to simulate the system's sensory inputs.
  • The generative model is updated using approximate Bayesian inference. 
  • Both frames give you similar phenomenological capabilities: for example, what CFAR’s "inner simulator" technique is doing is literally and explicitly conditioning your brain-based generative model on a given observation and generating rollouts. 

Given the conceptual similarity but terminological differences, perhaps it's useful to create a translation table between the maps:

Simulators terminologyPredictive processing terminology
Simulator Generative model
Predictive loss on a self-supervised datasetMinimization of predictive error
Self-supervisedSelf-supervised, but often this is omitted
Incentive to reverse-engineer the (semantic) physics of the training distributionLearns a robust world-model
Simulacrum
  • Generative model of self
  • Generative model of someone else 
  • Generative model of …
Next token in training dataSensory input


 

To show how these terminological differences play out in practice, I’m going to take the part of Simulators describing GPT’s properties, and unpack each of the properties in the kind of language that’s typically used in predictive processing papers. Often my gloss will be about human brains in particular, as the predictive processing literature is most centrally concerned with that example; but it’s worth reiterating that I think that both GPT and what parts of human brain do are examples of generative models, and I think that the things I say about the brain below can be directly applied to artificial generative models.

  • “Self-supervised: Training samples are self-supervised”
    • The system learns from sensory inputs in a self-supervised way.
    • The core function of the brain is simply to minimise prediction error, where the prediction errors signal mismatches between predicted input and the input actually received.[2]
  • “Converges to simulation objective: The system is incentivized to model the transition probabilities of its training distribution faithfully”
    • Prediction error minimization can be achieved … : through immediate inference about the states of the world model and through updating a global world-model to make better predictions
  • “Generates rollouts: The model naturally generates rollouts, i.e. serves as a time evolution operator”
    • The system can use its internal model to sample potential action-outcome trajectories, essentially using it to perform tree search
  • “Simulator / simulacra nonidentity: There is not a 1:1 correspondence between the simulator and the things that it simulates”
    • The most important part of the human “environment” is other people, and what human minds are often doing is probably simulating other human minds (including simulating how other people would be simulating someone else!).
  • “Stochastic: The model outputs probabilities, and so simulates stochastic dynamics when used to evolve rollouts”
    • The model is a stochastic causal model (e.g. "We illustrate variational Bayesian inference in the context of an important and broad class of generative models. These are stochastic dynamic causal models that combine nonlinear stochastic differential equations governing the evolution of hidden-states and a nonlinear observer function."[3])
  • “Evidential: The input is interpreted by the simulator as partial evidence that informs an uncertain prediction, rather than propagated according to mechanistic rules”
    • Perception is an inferential process that combines (top-down) prior information about the most likely causes of sensations with (bottom-up) sensory stimuli. Inferential processes operate on probabilistic representations of states of the world and follow Bayes’ rule, which prescribes the (optimal) update in the light of sensory evidence.

Mentioning the similarities, it is also important to mention the differences between the Simulators and generative models in predictive processing frames:

  1. The Simulators frame assumes that the simulator does not act on the world. In contrast, predictive processing is part of the broader theory of active inference. Active inference assumes a symmetry between perceptions and actions. As well as minimising the mismatch between its generative model and sensory inputs by changing its model (via perception), an active inference agent can also minimise mismatch by acting in the world, which causes its sensory inputs to change.
  2. The predictive processing literature usually assumes that the learning-prediction-action loop is running continuously, and does not ontologically distinguish between "training" and "runtime". In other words, predictive processing basically assumes continuous learning. On the other hand, it does differentiate between learning at different temporal scales.[4]
  3. Active inference literature usually assumes that in humans, the active inference system is warped by having something called "fixed priors", with some beliefs acting as an analogue of "wants" - leading the system to prefer some states. 
  4. There are also many superficial differences stemming from the fact that predictive processing originated as a theory of what human brains are doing. Most of the actual content of predictive processing literature is not about the high-level frame of "thinking about embodied stochastic Bayesian generative models", but about how this could be implemented in a neurologically plausible way.

In the following sections, I'll try to examine the relation of some of these assumptions to the actual AI systems we have or we are likely to develop.

GPT as a generative model with an actuator

Epistemic status: Confident, borderline obvious.


 

It’s common on LW to think of GPT-like systems as pure simulators.

GPT doesn’t have actuators in the physical world, but it does still have actuators in the sense that it can take actions which affect the world of its sensory inputs. GPT lives in the world of texts on the internet, approximately. A lot of the text GPT produces has some effect on this world. There are multiple causal pathways for this:

  • Direct inclusion of text generated by GPT in web pages.
  • People asking GPT for plans and executing those plans in the world.
  • Indirect influence on how people phrase things or think about things, e.g. learning about a concept from GPT.
  • Even more indirect influence routed through people, e.g. wide usage of GPT in education leading to greater convergence of humanity on more or less the same set of concepts and ontology.
  • GPT influencing other software systems or being part of them, e.g. Auto GPT executing plans in the world.[5]

(See How evolutionary lineages of LLMs can plan their own future and act on these plans for a different exploration of the action space by Roman Leventov.)

In the predictive processing frame, what’s going on here is:

  • Perception: in training, GPT is fed internet text. The system ‘perceives’ inputs from the external environment, and builds a generative model of the inputs. 
  • Action in the world: the outputs of GPT systems actually influence the world. You can see various individual conversations with GPT as micro-actions, but overall in sum these actions influence the world of text. So GPT in principle has an open causal path to make the world of words closer to GPT's generative model predictions.

Closing the action loop of active inference

Epistemic status: Moderately confident.

Given that the "not acting on the world" assumption of "pure simulation" does not hold, the main difference between GPT and active inference systems is that GPT isn’t yet able to perceive the impacts of its actions.

Currently, the feedback loop between action and perception in GPT systems is sort of broken - training is happening only from time to time, and models are running on old data:

So the action loop is open, not closed.

Note that if we investigate feedback loops in detail, this is often how they look - it’s just that if the objects are sufficiently identical, and the loops have the same time-scale, we usually understand this as a loop running in time, or a closed loop:




In practice, there seem to be multiple ways to close GPT’s action to observations loop:

  1. Continuous learning, updating the model on live data.
  2. Fine-tuning.
  3. Giving the model some sort of faster memory, by e.g. given it access to internet.
  4. Increasing the similarity between successive generations of the model such that self-identification and self-modelling are easier. (This is explored in Roman Leventov's writing considering "the GPT lineage" as an agent.)

I think there are strong instrumental reasons for people to try to make GPT update on continuous data, and I would expect this to make the action loop more prominent. One reason is that continuous learning allows models to quickly adapt to new information. 

Another way for the feedback to get more prominent is to give the model live access to internet content.

Even without continuous learning, we will get some feedback just from new versions of GPT getting trained on new data. 

All of this leads to the loop closing.

It's probably worth noting that if you dislike active inference terminology or find it really unintuitive, you can just think about the action-feedback loop, when closed, becoming an emergent control loop between the generative world model inside of GPT and the external world. 

It's probably also worth noting that the loop being closed is not an intrinsic property of the AI, but something which happens in the world outside of it.  

What to expect from closing the loop

Epistemic status: Speculative.

The loop becoming faster, thicker in bits, or both, will in my view tend to have some predictable consequences.  

Tighter and more data rich feedback loops will increase models’ self-awareness.

As feedback loops become tighter, we should expect models to become more self-aware, as they learn more about themselves and perceive the reflections of their actions in the world. It seems plausible that the concept of 'self' is convergent for systems influencing the environment which need to causally model the origins of their own actions 

Models’ beliefs will increasingly ‘pull’ the world in their direction.

Currently GPT basically minimises prediction error via learning a better generative model (the perception part of the feedback loop). With a tighter feedback loop, the training can also pick calculations which lead to loss minimization channelled through the world.

Note that this doesn’t mean GPT will ‘want’ anything or become a classical agent with a goal. While all of the above can be anthropomorphized and described as "GPT wanting something", this seems confusing. None of the dynamics depends on GPT being an agent, having intentions, or having instrumental goals in the usual anthropomorphic sense.

As an example, you can imagine some GPT computation coming up with a great way to explain some mathematical formula. In ChatGPT dialogues, many people learn this explanation. The explanation gets into papers and blogs. In the next training run, if the GPT' has or discovers the same computation, it will get reinforced. To reiterate, this can happen in a purely self-supervised learning regime. 

Technically, you can imagine that what will happen is that the next round of training will pick computations which were successful in pulling the world of words in their  direction. 

In my view, the sensible way of understanding this situation is to view it as a dynamical system, where the various feedback loops both pull the generative model closer to the world, and pull the world closer to the generative model.

Overall conclusion

In my view, "simulators" are generative models, but pure generative models form a somewhat unstable subspace of active inference systems. If simulation inputs influence simulation outputs, and the loop is closed, simulators tend to escape the subspace and become active inference systems.[6] 

The ideas in this post are mostly Jan’s. Thanks to Roman Leventov and Clem for comments and discussion which led to large improvements of the draft. Rose did most of the writing.

Appendix: transcript of conversation with ChatGPT

In process of writing this, Jan first tried to guide GPT-4 through the reasoning steps with a chain of prompts. When a specific sequence of instruction led to GPT-4 explaining mostly coherent chain of reasoning, we used the transcript in writing the post. Transcript available here. 

  1. ^

    Multi-modal GPTs trained on images may be in a slightly different position: in part, they are interacting directly with the world outside the cave, via images. On the other hand, it’s not clear whether this will directly improve the conceptual language skills of these models. Possibly multi-modal GPTs are best thought of as in the second cave, but with a periscope into reality. See https://arxiv.org/abs/2109.10246 for more on multi-modal LLMs.

  2. ^
  3. ^

     Daunizeau, Friston and Kiebel, ‘Variational Bayesian identification and prediction of stochastic nonlinear dynamic causal models’, Physica D: Non-linear Phenomena, 238:21, 2009.

  4. ^

    https://arxiv.org/abs/2212.01354: "On the present account, learning is just slow inference, and model selection is just slow learning. All three processes operate in the same basic way, over nested timescales, to maximize model evidence."

  5. ^
  6. ^

    Note that this does not mean the resulting type of system is best described as an agent in the utility-maximising frame. Simulators and predictors is still overall useful framework on how to look at the systems.

  7. ^

     

New Comment
4 comments, sorted by Click to highlight new comments since:

I only just got around to reading this closely. Good post, very well structured, thank you for writing it.

I agree with your translation from simulators to predictive processing ontology, and I think you identified most of the key differences. I didn't know about active inference and predictive processing when I wrote Simulators, but since then I've merged them in my map.

This correspondence/expansion is very interesting to me. I claim that an impressive amount of the history of the unfolding of biological and artificial intelligence can be retrodicted (and could plausibly have been predicted) from two principles:

  • Predictive models serve as generative models (simulators) merely by iteratively sampling from the model's predictions and updating the model as if the sampled outcome had been observed. I've taken to calling this the progenesis principle (portmanteau of "prognosis" and "genesis"), because I could not find an existing name for it even though it seems very fundamental.
    • Corollary: A simulator is extremely useful, as it unlocks imagination, memory, action, and planning, which are essential ingredients of higher cognition and bootstrapping.
  • Self-supervised learning of predictive models is natural and easy because training data is abundant and prediction error loss is mechanistically simple. The book Surfing Uncertainty used the term innocent in the sense of ecologically feasible. Self-supervised learning is likewise and for similar reasons an innocent way to build AI - so much so that it might be done on accident initially.

Together, these suggest that self-supervised predictors/simulators are a convergent method of bootstrapping intelligence, as it yields tremendous and accumulating returns while requiring minimal intelligent design. Indeed, human intelligence seems largely self-supervised simulator-y, and the first very general and intelligent-seeming AIs we've manifested are self-supervised simulators.

A third principle that bridges simulators to active inference allows the history of biological intelligence to be more completely retrodicted and may predict the future of artificial intelligence:

  • An embedded generative model can minimize predictive loss both by updating the model (perception) to match observations or "updating" the world so that it generates observations that match the model (action).

The latter becomes possible if some of the predictions/simulations produced by the model make it act and therefore entrain the world. An embedded model has more degrees of freedom to minimize error: some route through changes to its internal machinery, others through the impact of its generative activity on the world. A model trained on embedded self-supervised data naturally learns a model correlating its own activity with future observations. Thus an innocent implementation of an embedded agent falls out: the model can reduce prediction error by simulating (in a way that entrains action) what it would have done conditional on minimizing prediction error. (More sophisticated responses that involve planning and forming hierarchical subgoals also fall out of this premise, with a nice fractal structure, which is suggestive of a short program.)

The embedded/active predictor is distinguished from the non-embedded/passive predictor in that generation and its consequences are part of the former's model thanks to embedded training, leading to predict-o-matic-like shenanigans where the error minimization incentive causes the system to cause the world to surprise it less, whereas non-embedded predictors are consequence-blind

In the active inference framework, error minimization with continuity between perception and action is supposed to singlehandedly account for all intelligent and agentic action. Unlike traditional RL, there is no separate reward model; all goal-directed behavior is downstream of the model's predictive prior.

This is where I am somewhat more confused. Active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias, which causes them to predict and act out optimistic futures (I assume this is what you meant by "fixed priors") so as to minimize surprise. I'm not sure where this bias "comes from" or is implemented in animals, except that it will obviously be selected for. 

If you take a simulator without a fixed bias or one with an induced bias (like an RLHFed model), and embed it and proceed with self-supervised prediction error minimization, it will presumably also come to act agentically to make the world more predictable, but the optimization that results will probably be pointed in a pretty different direction than that imposed by animals and humans. But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values), close the active inference loop, and the rest will more or less take care of itself. To do this still requires solving the full alignment problem, but its constraints and peculiar nondualistic form may inspire some insight as to possible implementations and decompositions.

You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circuitry likely encodes a lot about what the evolution 'hopes for' in terms of what states the body will occupy. Subsequently, when building predictive/innocent models and turning them into active inference, my guess a lot of the specification is done by 'fixing priors' of interoceptive inputs on values like 'not being hungry'.  The later learned structures than also become a mix between beliefs and goals: e.g. the fixed prior on my body temperature during my lifetime leads to a model where I get 'prior' about wearing a waterproof jacket when it rains, which becomes something between an optimistic belief and 'preference'.  (This retrodicts a lot of human biases could be explained as "beliefs" somewhere between "how things are" and "how it would be nice if they were")


But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values)
 

My current guess is any approach to alignment which will actually lead to good outcomes must include some features suggested by active inference. E.g. active inference suggests something like 'aligned' agent which is trying to help me likely 'cares' about my 'predictions' coming true, and has some 'fixed priors' about me liking the results. Which gives me something avoiding both 'my wishes were satisfied, but in bizarre goodharted ways' and 'this can do more than I can'

Great post; a few short comments:

Closing the action loop of active inference

There is a sense in which this loop is already closed - the sensory interface for an LLM is a discrete space of size context window x vocabulary that it observes and acts upon. The environment is whatever else writes to this space, e.g., a human interlocutor. This description contains the necessary variables and dependencies to get an action-perception loop off the ground. One caveat is that action-perception loops usually have actions that influence the environment to generate desirable observations, whereas LLMs directly influence their observation space. However, there are counter-examples, such as LLMs generating questions that cause the environment (a user) to generate the desired observations. 

Fixed priors/desires

In active inference, the agent's wants/desires are usually expressed in terms of its stationary distribution over observations (equated with its generative world model). A typical example might be the desire to have "blood temperature at 37 degrees," which would be interpreted as assigning a high probability to observing blood temperature at 37 degrees. 

You could argue that LLMs already have this attribute by parametrizing a distribution over likely sequences. In active inference terminology, when an LLM observes "The cat sat on..." it wants to observe "the mat" and acts on the world to make this happen. 

A small example to help illustrate points 1 and 2: imagine an LLM trained to generate sequences describing the history of human tool use. The LLM assigns a probability distribution over sequences (its desires) and acts to manifest these. Suppose some external process (the environment) periodically inserts random low-probability tokens. The LLM will observe these and will act to course correct back to higher probability regions of sequence space (the action-perception loop). 

If the external process is predictable, the LLM will move to parts of the state space that best account for the effects of the environment and its model of the most likely sequences (loosely analogous to a Bayesian posterior). For example, if the external process is generating tokens related to bronze - the LLM will describe tool use in the bronze age. 

It's also worth highlighting the differences between a system that outputs probabilities and a system whose internal states parameterize a probability distribution. Most active inference models fall into this latter category, while it's not obvious that LLMs do. However, some arguments might suggest they can be implicitly interpreted this way. 

If the external process is predictable, the LLM will move to parts of the state space that best account for the effects of the environment and its model of the most likely sequences (loosely analogous to a Bayesian posterior).

I think it would be more accurate to say that the dynamics of internal states of LLMs parameterise not just the model of sequences but of the world, including token sequences as the sensory manifestation of it.

I'm sure that LLMs already possess some world models (Actually, Othello-GPT Has A Linear Emergent World Representation), the question is how only really how the structure and mechanics of LLMs' world models are different from the world models of humans.