Finally, if we want to make the model capture certain non-Bayesian human behaviors while still keeping most of the picture, we can assume that instrumental values and/or epistemic updates are cached. This creates the possibility of cache inconsistency/incoherence.
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP. Or is the idea possibly that everything in the architecture uses caching and instrumental values? From reading, I imagined a memory+cache structure instead of being closer to "cache all the way down".
Apart from this, I would bet that something interesting will happen for a somewhat human-comparable agent with regards to self-modelling and identity. Would anything similar to human identity emerge or would this require additional structure? Some representation of the agent itself, and its capabilities should be present at least
"Cached" might be an unhelpful term here, compared to "amortized". 'Cache' makes one think of databases or memories, as something you 'know' (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do - fusing inference with action. (They are 'cached' in the same way that you might loosely talk about a neural net 'caching' a complicated-to-compute function, like a value function in RL/decision theory.)
So 'amortized' tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agents (like LLMs) are doing: they are not (usually) implementing the Bayes-optimal backwards induction over the full decision-tree solving the POMDP when they engage in meta-learning like in-context learning (which leads you to infeasibilities like AIXI), they are doing amortized optimization. Depending on available time & compute, an agent might, at any given moment, be doing something anywhere on the spectrum from hardwired reflex to cogitating for hours explicitly on a tree of possibilities. (Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes. Or in expert iteration like AlphaZero, you have the CNN executing an amortized version of all previous MCTS searches, as distilled into the CNN, and then executing some more explicit tree search to improve its current estimates and then amortize that back into the CNN again to improve the policy some more.)
They gradually learn, applying some optimization one at a time, to implement a computation increasingly equivalent to the Bayes-optimal actions, which may boil down to an extremely simple algorithm like tracking a single sufficient-statistic summarizing the entire history and implementing an if-then-else on a boundary value of it (eg. drift-diffusion); Duff 2002 suggests thinking of it as "compiling" the full Bayes-optimal program interpreted flexibly but slowly at runtime down into a fast optimized but inflexible executable specialized for particular cases. A beautiful example of reading off the simple head/tails counting algorithm implemented by a meta-learning RNN can be seen in https://arxiv.org/pdf/1905.03030.pdf#page=6&org=deepmind EDIT: I go through a lot of this for my Kelly coin-flip page, but also here's some recent research doing the same thing, but with different non-Bayesian terminology is https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their
(I have more links on this topic; does anyone have a better review of the topic than "Bayesian Reinforcement Learning: A Survey", Ghavamzadeh et al 2016? I feel like a major problem with discussion of LLM scaling is that the Bayesian RL perspective is just not getting through to people, and part of the problem is I'm not sure what 'the' best introduction or summary writeup is. People can hardly be expected to just go and read 30 years of Schmidhuber papers...)
Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you're referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
I think that when most people picture a Bayesian agent, they imagine a system which:
Typically, we define Bayesian agents as agents which behaviorally match that picture.
But that’s not really the picture David and I typically have in mind, when we picture Bayesian agents. Yes, behaviorally they act that way. But I think people get overly-anchored imagining the internals of the agent that way, and then mistakenly imagine that a Bayesian model of agency is incompatible with various features of real-world agents (e.g. humans) which a Bayesian framework can in fact handle quite well.
So this post is about our prototypical mental picture of a “Bayesian agent”, and how it diverges from the basic behavioral picture.
Causal Models and Submodels
Probably you’ve heard of causal diagrams or Bayes nets by now.
If our Bayesian agent’s world model is represented via a big causal diagram, then that already looks quite different from the original “enumerate all states/trajectories” picture. Assuming reasonable sparsity, the data structures representing the causal model (i.e. graph + conditional probabilities on each node) take up an amount of space which grows linearly with the size of the world, rather than exponentially. It’s still too big for an agent embedded in the world to store in its head directly, but much smaller than the brute-force version.
(Also, a realistic agent would want to explicitly represent more than just one causal diagram, in order to have uncertainty over causal structure. But that will largely be subsumed by our next point anyway.)
Much more efficiency can be achieved by representing causal models like we represent programs. For instance, this little “program”:
… is in fact a recursively-defined causal model. It compactly represents an infinite causal diagram, corresponding to the unrolled computation. (See the linked post for more details on how this works.)
Conceptually, this sort of representation involves lots of causal “submodels” which “call” each other - or, to put it differently, lots of little diagram-pieces which can be wired together and reused in the full world-model. Reuse means that such models can represent worlds which are “bigger than” the memory available to the agent itself, so long as those worlds have lots of compressible structure - e.g. the factorial example above, which represents an infinite causal diagram using a finite representation.
(Aside: those familiar with probabilistic programming could view this world-model representation as simply a probabilistic program.)
Updates
So we have a style of model which can compactly represent quite large worlds, so long as those worlds have lots of compressible structure. But there’s still the problem of updates on that structure.
Here, we typically imagine some kind of message-passing, though it’s an open problem exactly what such an algorithm looks like for big/complex models.
The key idea here is that most observations are not directly relevant to our submodels of most of the world. I see a bird flying by my office, and that tells me nothing at all about the price of gasoline[1]. So we expect that, the vast majority of the time, message-passing updates of a similar flavor to those used on Bayes nets (though not exactly the same) will quickly converge, without having to explicitly propagate to most of the submodel-nodes.
Latents
Message-passing on large models does still have some efficiency issues, however. To make things more efficient, we expect that realistic agents typically structure their model around “latent variables” which mediate most interactions. For instance, early 20th century biologists would observe that some species of animals had very similar anatomy, physiology, or behavior - i.e. if one wrote out a giant list of traits, some species would end up with very highly correlated lists. From this, they inferred some latent (i.e. not directly observed) relationship between those species - in this case, shared evolutionary ancestry. The extent to which this inference was correct varied - inferences are sometimes wrong, even when the reasoning is basically right - but either way, that “mediation by latent shared ancestry” pattern sure was how biologists structured their models.
Humans in general seem to do a very similar thing when modeling the world as containing "kinds of things" - i.e. we notice that there's a cluster of things which have bark, leaves, wood, roots, etc, all connected in a shape with a central trunk recursively branching out both above and below ground... Then we intuitively model all these things as stemming from some latent variable (e.g. "tree-ness"). That latent variable, in our internal models, explains the correlations: a child might ask "why do things which have bark also have roots?", and we might reply "because they're trees". Again, there's room to argue about how well that answers the child's question, but the answer does seem to reflect the internal structure of our models either way.
One key issue: different agents could, in principle, model the same environment using different latents; the latents are not necessarily fully determined by the prior + environment. For instance, I could model a bunch of rolls of a biased die as mediated by an unknown “bias”, or I could model them as just a bunch of rolls with some complicated correlations between them. The predictions will be the same. In practice minds mostly seem to converge on quite similar latents, and the general project of natural abstraction is largely aimed at understanding when and why that happens.
Aside: Map-Territory Correspondence
There is no rule saying that the variables in a Bayesian agent’s world-model have anything to do with “things” in their environment. I could totally write a Bayesian agent which models itself as living in Conway’s Game of Life and tries to maximize a utility function defined over things in Conway’s game of life (like e.g. number of gliders), but then I could wire up the inputs and outputs of that agent to a photosensor and motor in my office. The agent will mostly be very confused (i.e. its predictions will be wrong a lot), and won’t do anything interesting, but it would be a valid Bayesian agent.
In particular, it’s the latents in the model which don’t need to correspond to anything in the environment. The variables which the agent maps to its observations and actions (as opposed to latents, which are everything else), do have some rigid “correspondence”, because when the agent receives inputs it will map them to its observations, and when the agent yields outputs it will map them to its actions.
A more realistic example: some humans believe in e.g. spirits or the like. Much like the Conway’s Game of Life bot, they are just very confused, and those parts of their world model involving spirits don’t necessarily “correspond to” any actual structure in the world.
… Nonetheless, in practice it seems like most latents in most humans’ models do “correspond to” stuff in the world in some important sense, and understanding that correspondence is another big part of the general project of natural abstraction.
Utility Over Latents
One big reason that latent variables are important is that, insofar as it makes sense to view real-world agents as Bayesians at all, the inputs to those agents’ utility functions are typically latent variables - not observations or actions directly. This follows from common sentiments like “I want my spouse to actually be happy, not just to look-to-me like they’re happy”. “Look-to-me like they’re happy” would be a utility function whose inputs are my own observations directly; “actually be happy” is a utility function whose inputs are latent variables representing my spouse.
For more on this topic, see The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables.
Lazy Utility Maximization
Even if causal models structured like programs and message-passing and latents allow for efficient updates of models of large worlds (and, to be clear, we don’t think we currently have the whole story here), there’s still the question of how to efficiently maximize expected utility over the model.
A key idea here is that we never actually need to calculate expected utility, in order to maximize it.
For example, suppose I’m deciding what to order for lunch. I expect this decision to be basically-irrelevant to the vast majority of things I care about in the world and in life. But if I want to calculate my full expected utility, I need to account for all those things, from Dad’s collection of old milk bottles to future tiny genetically engineered dragons. But I don’t need to calculate all that, in order to make an expected-utility-maximizing lunch order. I just need to calculate the difference between the utility which I expect if I order lamb Karahi vs a sisig burrito.
… and since my expectations for most of the world are the same under those two options, I should be able to calculate the difference lazily, without having to query most of my world model. Much like the message-passing update, I expect deltas to quickly fall off to zero as things propagate through the model.
Caching and Inconsistency
Here we’ll diverge somewhat from a strictly behaviorally Bayesian agent, but in a way which plays particularly well with an otherwise-Bayesian agent.
Richard Bellman popularized the idea of dynamic programming: in this context, making utility maximization calculations more efficient by precomputing and caching the instrumental values of intermediates. Insofar as we imagine our supposedly-Bayesian agent maintaining some instrumental value cache, we open the door to a certain kind of “incoherence”: the values in the cache may, for some reason, be inconsistent with either each other or the agent’s utility function. This sort of incoherence could be locally detected and fixed, by checking whether the cached values locally satisfy the Bellman equation (with the exact flavor of Bellman equation depending on what style of model we’re using for the Bayesian agent).
Similarly, we could imagine caching being useful epistemically, for efficient updates. There again, failures of cache maintenance could result in “inconsistent beliefs”.
If and when cache inconsistency is detected, the agent might require quite a bit of propagation - i.e. thinking and reflection - to sort it out.
Putting It All Together
When we picture a “Bayesian agent”, we’re typically picturing an agent with a world-model which looks basically like a moderately-sized program with a lot of recursion. That “program” represents a big causal model as a bunch of smaller submodels, which get reused and “call” each other.
Updates are performed via some sort of message-passing; we expect that the messages don’t typically need to propagate very far. Similarly, to maximize expected utility, the agent only needs to compute the difference in expected utility between options available in its current decision. As with updates, such differences are expected to typically not propagate very far.
Most of the variables in the model are latents, as opposed to variables directly representing observations or actions. Such latents don’t have to correspond to anything in the world; the fact that they usually seem to correspond to stuff in the world in some sense is an interesting empirical fact, and characterizing that “correspondence” is one big piece of the general project of natural latents. One reason such latents are important (even without bringing e.g. language into the picture) is that the inputs to the agent’s utility function are typically latents rather than observations/actions - e.g. “I want my spouse to actually be happy, not just to look-to-me like they’re happy”.
Finally, if we want to make the model capture certain non-Bayesian human behaviors while still keeping most of the picture, we can assume that instrumental values and/or epistemic updates are cached. This creates the possibility of cache inconsistency/incoherence.
John is clearly a complete amateur at augury, but the meaning here is hopefully still clear.