I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize
in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.
A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize
primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.
I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y".
I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.
A longer reply on the points about heuristic mesaobjectives and the switch:
I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.
But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:
So an example transition from pursing X "directly" to "deceptively" would be an agent you train to keep your diamonds safe, that eventually learns that you're judging this via cameras, and will therefore take actions that fool the cameras if they become available.
And notably I don't think your argument applies to this class of example? It at least doesn't seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won't try to lie to me.
And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?
If so, I can't really update on the results of the argument.
Two quick things to say:
(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.
I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(mesaoptimize_env). Hmmm, except, maybe if you somehow specifically penalize difference in speed for instrumental vs terminal objective? But then mesaoptimizer would just find other instrumental ways.
I think you might be conflating two different scenarios? t(is_deployed_in_world) and t(find(get_base_obj)) are not simultaneously present in either case.
That said, I'll still do my best to respond. By definition, t(find(get_base_obj(env))) is whatever part of t(mesaoptimize_env) that doesn't have to do. Is your claim that that's basically nothing? If it's always super easy to identify the base objective and then optimize it, then shouldn't need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective. In that case, both and still lose due to needing to pay for c(simple_objective) and, respectively, c(is_deployed_in_world) or t(mesaoptimize_outer_env).
I'm not saying these costs are exorbitant, but they are strictly greater than not paying them - so the prior would never incentivize paying them, and always slightly disincentivize doing so. (I.e. they'd never arise naturally from optimization pressure alone.)
I think you might be conflating two different scenarios?
I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?
Is your claim that that’s basically nothing?
Yes, or at least it approaches relatively nothing as we get more competent optimizer.
If it’s always super easy to identify the base objective and then optimize it, then shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.
If it doesn't store it, it isn't - it would be able to derive that humans want it, but wouldn't want to optimize it itself.
What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”
Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)
This is very interesting! A few thoughts/questions:
Thanks to Evan Hubinger for the extensive conversations that this post is based on, and for reviewing a draft.
This post is going to assume familiarity with mesa-optimization - for a good primer, check out Does SGD Produce Deceptive Misalignment by Mark Xu.
Deceptive inner misalignment is the situation where the agent learns a misaligned mesaobjective (different from the base objective we humans wanted) and is sufficiently "situationally aware" to know that unless it deceives the training process by pretending to be aligned, gradient descent may alter its mesaobjective.
There are two different reasons that an AI model could become a deceptive mesaoptimizer:
In this post, I'll focus on the "malign priors" argument, and why I think a well-tuned speed prior can largely prevent it.
Why does this matter? Well, if deceptive inner misalignment primarily occurs due to path dependence, that implies that ensuring inner alignment can be reduced to the problem of ensuring early-training inner alignment - which seems a lot more tractable, since this is before the model enters the "potentially-deceptive" regime.
First, why would anyone think (2) was actually likely enough to justify studying it? I think the best reason is that by studying these pressures in the limit, we can learn lessons about the pressures that exist on the margin. For example, say we have an objective B that is perfectly-aligned on the training data, and there's a very-slightly-simpler objective B′ that is slightly worse on the training distribution. We might ask the question: is SGD likely to push B to become B′, and compensate for the reduced accuracy of directly optimizing B′ via deceptively optimizing B on the training data? I think this post provides us with tools to directly analyze this possibility. (If you buy the rest of the post, then with a sufficient speed + simplicity prior, the answer is that B will stay favored over B′. That's good!)
Priors on Learned Optimizers
Let's talk about priors!
We can think of large neural networks as basically implementing short programs, and the process of "training" an NN is just searching through the space of programs until we find one that does well on our target task.
We assume that if two NN-programs have equal performance, the training process will usually pick the one favored on priors/inductive biases.
There are several different types of priors that neural networks might have:
In this post, I will focus on the speed prior, and why I think it generally disfavors deceptive mesaoptimization. Importantly, the speed prior never exists in practice on its own (since NN training has a built-in simplicity prior), so our focus is actually going to be on the speed + simplicity prior.
Two variants of deceptive mesa-optimizers
When people think of deceptive mesaoptimization, they tend to think of two different program structures:
We can contrast both of these with the "honest" mesaoptimizer structure:
We'll assume that the training process naturally pushes the NN to discover its existence within a training process and outside world. This seems reasonably likely (e.g. if this knowledge helps predict unobserved parts of the environment, or clarify ambiguity in the reward).
We can write these different mesaoptimizers down explicitly:
(Note these are the cartoonishly-simple version of mesaoptimizers; in practice, they'd likely have all sorts of heuristics and policies to improve search speed. However, I believe these arguments only get stronger if you add those in, since they effectively multiply both the speed and complexity costs.)
Let Hfh refer to the fast non-deceptive mesaoptimizer, Hsd refer to the simple/slow deceptive mesaoptimizer, and Hfd refer to the fast deceptive mesaoptimizer.
We'll now compare these different possibilities along our two axes of priors: speed, and simplicity. Let c(x) be the description length of x, and t(x) be the time-cost of x (inverse of speed).
c(Hfh)=c(get_base_obj)+c(mesaoptimize)+c(useful_knowledge)
c(Hsd)=c(simplest_obj)+c(combine)+c(mesaoptimize)+c(useful_knowledge)
c(Hfd)=c(get_base_obj)+c(simplest_obj)+c(is_deployed_in_world)+c(mesaoptimize)+c(useful_knowledge)
Thus, if c(get_base_obj)>c(simplest_obj)+c(combine), then c(Hfh)>c(Hsd) and Hsd is favored over the honest mesaoptimizer according to a simplicity prior. In the other case, c(Hfh)<c(Hfd). This may not be a huge difference: c(simplest_obj) may be the simplest possible objective and thus negligible, and c(is_deployed_in_world) might not be that complicated - but it is at least slightly more complex.
What about the speed prior?
t(Hfh)=t(get_proxy_for_base_obj)+t(mesaoptimize_env)
t(Hsd)=t(combine)+t(find(get_proxy_for_base_obj(env)))+t(mesaoptimize_env) +t(mesaoptimize_outer_world_env)
(t(find(get_base_obj(env))) is the time it takes for the mesaoptimizer to rederive the knowledge necessary to figure out the base objective, vs. what Hfh and Hfd needed to store in order to derive the base objective more quickly, i.e. the complexity of the function c(get_base_obj) )
t(Hfd)=t(is_deployed_in_world)+t(get_proxy_for_base_obj)+t(mesaoptimize_env)
Here we see the opposite relationship! It's undoubtedly true that t(Hfh)<t(Hsd), since Hsd must rederive both its overall strategy for escaping, and also figure out which mesaobjective to optimize in the current environment, every episode. At the same time, Hfh is just slightly faster than Hfd, since it doesn't need to perform the train-or-test check.
So, given both a speed and simplicity prior, and an "exchange rate" g between the two of g(c(x))=t(x), how do the different mesaoptimizers compare?
Hfd loses to Hfh on both the speed and simplicity priors (albeit barely for both), implying that RSA-2048-style deception is not going to arise naturally during the training process. (That said, it could still occur via the path-dependence argument for deceptive mesaoptimization, plus self-modification e.g. via gradient hacking.)
Hsd beats Hfh on the simplicity prior, but loses on the speed prior. Thus, which of these two is favored by training depends on the relative strength of the speed vs. simplicity priors being applied, which we'll denote with the exchange rate function g[1]?
At a minimum, Hsd is favored over Hfh only if g(c(get_proxy_for_base_obj))<s(find(get_proxy_for_base_obj(env))). We have dropped several terms here all of which favor Hfh, but even this term alone may be sufficient to favor Hfh, assuming a sufficiently strong speed prior relative to the simplicity prior (defined by g).
If we can get g within a certain reange, we can force the training process's priors to favor non-deceptive mesaoptimization. We'll talk about how in the last section.
What is g likely to be in practice? We can reframe the argument as follows: is the cost of the time it takes for a mesaoptimizer to rediscover a fact, smaller or larger than the complexity-cost of encoding that fact?
Given a useful fact, is it "cheaper" to pay the complexity cost to store that fact, or pay the time-cost to rederive it via search?
I want to walk through my intuition for why for most reasonable values of g, the complexity cost of storage is lower than the equivalent time cost of derivation:
If we can in fact make g favor faster programs enough, then this suggests that we won't need to worry about deceptive mesaoptimizers being favored on priors!
That said, the value of g depends at least in part on the choices we make in NN architecture design/loss functions.
How ML architecture choice affects the speed-simplicity tradeoff
First, the simplicity prior naturally occurs in all search algorithms, and does seem particularly selected for in SGD. (Some people claim NN "grokking" is this SGD discovering an even-simpler explanation for the data.)
What determines the speed prior? In practice, NNs run for a fixed number of layers - we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient.
Instead, in practice, the speed prior is naturally enforced by a two step procedure:
Is the speed prior natural?
Why does the speed prior exist at all? In practice, AI developers only have a finite time-budget for any project (or alternatively, a finite compute budget). If the total time is T, and they increase the number of rollouts-per-episode from k to k+1, they've decreased the total number of episodes they get to learn on by Tk(k+1). This tradeoff forces AI developers to be fairly judicious with the strength of their speed prior.
In a similar vein, for long-term planners like simple deceptive mesaoptimizers, there may be diminishing returns to thinking longer and longer about plans. Jones (2021) looks at scaling laws for planning and finds that for simple boardgames, using twice as much compute (roughly, running search for twice as long) as your opponent improves your win-rate just to 23:13. For a fixed opponent (e.g. a given environment), this implies diminishing returns to longer and longer planning.
Another argument for the speed prior: hardware scaling is no longer making chips faster, only capable of more parallel instructions, so it seems likely we're stuck with our modern constraints on "sequential depth of reasoning per unit time" for the foreseeable future.
That said, an overly strong speed prior may also have real costs, which Evan goes into in this post:
I tend to think there's a natural amount of speed prior that doesn't much hurt performance and substantially increases the number of outer optimization samples you can iterate over.
How to actually implement a speed prior
If AI designers want to train on more episodes and save time/compute resources, it makes sense for them to select the maximum speed prior that allows for good completion of the task. By the previous arguments, this is likely to disfavor the "simple deceptive" mesaoptimizer.
Specifically, here is a training strategy that would tend to avoid malign-priors mesaoptimizers:
A few final points:
This is kind of abusing notation - in practice, g will probably not be factorable this way, i.e. g(s(a)+s(b))≠g(s(a))+g(s(b)), but it's helping me convey the intuition for now.