replacing the SGD with something that takes the shortest and not the steepest path
Maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x – x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.
The SGD's greed, to be specific.
Consider a ML model being trained end-to-end from initialization to zero loss. Every individual update to its parameters is calculated to move it in the direction of maximal local improvement to its performance. It doesn't take the shortest path from where it starts to the ridge of optimality; it takes the locally steepest path.
1. What does that mean mechanically?
Roughly speaking, every feature in NNs could likely be put into one of two categories:
The world-model can only be learned gradually, because higher-level features/statistical correlations build upon lower-level ones, and therefore the gradients towards learning them only appear after the lower-level ones are learned.
Heuristics, in turn, can only attach to the things that are already present in the world-model (same for values). They're functions of abstractions in the world-model, and they fire in response to certain WM-variables assuming certain values. For example, if the world-model is nonexistent, the only available heuristics are rudimentary instincts along the lines of "if bright light, close eyes". Once higher-level features are learned (like "a cat"), heuristics can become functions of said features too ("do X if see a cat", and later, "do Y if expect the social group to assume state S within N time-steps").
The base objective the SGD is using to train the ML model is, likewise, a function of some feature/abstraction in the training data, like "the English name of the animal depicted in this image" or "the correct action to take in this situation to maximize the number of your descendants in the next generation". However, that feature is likely a fairly high-level one relative to the sense-data the ML model gets, one that wouldn't be loaded into the ML model's WM until it's been training for a while (the way "genes" are very, very conceptually far from Stone Age humans' understanding of reality).
So, what's the logical path through the parameter-space from initialization to zero loss? Gradually improve the world-model step by step, then, once the abstraction the base objective cares about is represented in the world-model, put in heuristics that are functions of said abstraction, optimized for controlling that abstraction's value.
But that wouldn't do for the SGD. That entire initial phase, where the world-model is learned, would be parsed as "zero improvement" by it. No, the SGD wants results, and fast. Every update must instantly improve performance!
The SGD lives by messy hacks. If the world-model doesn't yet represent the target abstraction, the SGD will attach heuristics to upstream correlates/proxies of that abstraction. And it will spin up a boatload of such messy hacks on the way to zero loss.
A natural side-effect of that is gradient starvation/friction. Once there's enough messy hacks, the SGD won't bother attaching heuristics to the target abstraction even after it's represented in the world-model — because if the extant messy hacks approximate the target abstraction well enough, there's very little performance-improvement to be gained by marginally improving the accuracy so. Especially since the new heuristics will have to be developed from scratch. The gradients just aren't there: better improve on what's already built.
2. How does that lead to inner misalignment?
It seems plausible that general intelligence is binary. A system is either generally intelligent, or not; it either implements general-purpose search, or it doesn't; it's either an agent/optimizer, or not. There's no continuum here, the difference is sharp. (In a way, it's definitionally true. How can something be more than general? Conversely, how can something less than generally capable be called "generally intelligent"?)
Suppose that the ML model we're considering will make it all the way to AGI in the course of training. At some point, it will come to implement some algorithm for General-Purpose Search. The GPS can come from two places: either it'll be learned as part of the world-model (if the training data include generally intelligent agents, such as humans), or as part of the ML model's own policy. Regardless of the origin, it will almost certainly appear at a later stage of training: it's a high-level abstraction relative to any sense-data I can think of, and the GPS's utility can only be realized if it's given access to an advanced world-model.
So, by the time the GPS is learned, the ML model will have an advanced world-model, plus a bunch of shallow heuristics over it.
By its very nature, the GPS makes heuristics obsolete. It's the qualitatively more powerful optimization algorithm, and one that can, in principle, replicate the behavior of any possible heuristic/spin up any possible heuristic, and do so with greater accuracy and flexibility than the SGD.
If the SGD were patient and intelligent, the path forward is obvious: pick out the abstraction the base objective cares about in the world-model, re-frame it as the mesa-objective, then aim the GPS at optimizing it. Discard all other heuristics.
However, it's not that easy. Re-interpreting an abstraction as a mesa-objective is a nontrivial task. Even more difficult is the process of deriving the environment-appropriate strategies for optimizing it — the instincts, the technologies, the sciences. If evolution were intelligent, and had direct write-access to modern geneticists' brains... Well, replacing all of their value system with obsession with increasing their inclusive genetic fitness wouldn't instantly make effective gene-maximizers of them. They'll get there eventually, but that will require a significant amount of re-training on their part — despite the fact that they know perfectly well what a "gene" is[1].
So there wouldn't be strong gradients towards aiming the GPS at the representation of the base objective. No, gradient starvation would rear its head again:
There'll already be a lot of heuristics aimed at optimizing upstream correlates of the base objective, and their weighted sum will presumably serve as a good proxy objective (inasmuch as the model would've already been optimized for good performance even prior to the GPS' appearance). These heuristics will contain a lot of what we want: the instincts and the local knowledge of how to get things done in the local environment.
So the SGD will enslave the GPS to the heuristics. The GPS will be used to improve the heuristics' efficiency, gradually re-interpreting said heuristics as instincts, skills, or mesa-objectives. They will morph from the optimization procedure to the optimization targets.
At this point, the SGD will have two choices regarding the parameter updates: it can spend its "budget" to tweak the mess of mesa-objectives to more closely resemble the actual base objective, or it can just increase the power of the GPS (give it more processing power, more working memory, access to a richer world-model...). The SGD will probably do both to some extent, but which way will it lean?
Well, as above, the sum of the mesa-objectives should be a good proxy of the base objective at the current level of optimization power, such that optimizing for them optimizes for the base objective. In addition, they will likely take up a whole lot of parameters. It'll take a lot of updates to meaningfully rewrite them, and slight tweaks will correspond to just that: slight performance improvements. Conversely, slightly improving the GPS will likely lead to outsized improvements in performance[2].
Up to a point. Tails do come apart, and at some point, further improvements to capabilities will lead to worse performance on the training objective, requiring adjusting the mesa-objective to be more in-line with it. But then, post-adjustement, the mesa-objective will become a good proxy again, adapted for this new level of optimization power, and the SGD will switch to capability improvement again. It'll switch between these two modes, until...
Until the ML model becomes clever enough to become situationally aware, and try its hand at gradient hacking and deceptive alignment. Then the winds would change for good: improvements to its capabilities will lead to outsized improvements in its ability to derive the base objective at runtime, which will lead to outsized improvements in the ML model's performance on that base objective. At the same time, adjusting its mesa-objective will still lead only to marginal improvements. (At best. If the ML model is smart enough, it might destroy the gradients between its current values and the base objective altogether, by e. g. committing to output garbage if it notices meddling.)
Thus, there's a fairly narrow range of capabilities at which the ML model is smart enough that tails come apart, yet not smart enough to fall into the deceptive-alignment attractor. While it occupies that range, its mesa-objective will be moved towards the base objective. But by default, I think, it leaves that range fairly quickly.
And so we get deceptive alignment, by strong default.
(In addition, I argue that this causes high path dependence in sufficiently advanced models/AGI-level models, under this formulation:
The features the ML models learn, and their order, appear to be a robust function of the training data + the training process, so I suspect there isn't much variance across training runs. But the final mesa-objectives are a function of a function of ... a function of the initially-learned shallow heuristics — I expect there is strong path-dependence in that sense.)
3. Recap
Open questions: To what extent do the mesa-objectives get adjusted towards the base objective once the GPS crystallizes? How broad is the range between tails-come-apart and deceptive alignment? Can that range be extended somehow?
4. What can be done?
Well, replacing the SGD with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem. That's a pipe dream, though.
Failing that, it sure would be nice if we can get rid of all of those pesky heuristics.
One way to do that is featured here. Take a ML model optimized for an objective A. Re-train it to optimize an objective B, picked such that we expect the ML model's WM to feature the abstraction that B is defined over (such as, for example, a diamond). The switch should cause the mess of heuristics for optimizing A to become obsolete, reducing the gradient-starvation effect from them. And, if we've designed the training procedure for B sufficiently well, presumably the steepest gradients will be towards developing heuristics/mesa-objectives that directly attach to the B-representation in the ML model's WM.
John counters that this only works if the training procedure for B is perfect — otherwise the steepest gradient will be towards whatever abstraction is responsible for the imperfection (e. g., caring about "things-that-look-like-a-diamond-to-humans" instead of "diamonds").
Another problem is that a lot of heuristics/misaligned mesa-objectives will presumably carry over. Instrumental convergence and all — things like power-seeking will remain useful regardless of the switch in goals. And even if we do the switch before proper crystallization of the power-seeking mesa-objective, its prototype will carry over, and the SGD will just continue from where it left off.
In fact, this might make the situation worse: the steepest path to achieving zero-loss on the new objective might be "make the ML model a pure deceptively-aligned sociopath that only cares about power/resources/itself", with the new value never forming.
So here's a crazier, radical-er idea:
Naively, what we'll get in the end is an honest genie: an AI that consists of the world-model, a general-purpose problem-solving algorithm, and minimal "connective tissue" of the form "if given a command by a human, interpret what they meant[4] using my model of the human, then generate a plan for achieving it".
What's doing what here:
And so we'll get a corrigible/genie AI.
It sure seems too good to be true, so I'm skeptical on priors, and the pragmascope would be non-trivial to develop. But I don't quite see how it's crucially flawed yet.
Presumably they'll have to re-train as businessmen and/or political activists, in order to help the sperm donor companies they'll start investing in to outcompete all other sperm donor companies?
Perhaps the same way IQ-140 humans are considerably more successful than IQ-110 ones, despite, presumably, little neurological differences.
Just put it through training episodes where it's placed in an environment and given a natural-language instruction on what to do, I guess?
Really "meant", as in including all the implied caveats like "when I ask for a lot of strawberries, I don't mean tile the entire universe with strawberries, also I don't want them so much that you should kill people for them, also...".
Importantly, what we don't want to use here is the speed regularizer. It's often mentioned as the anti-deception tool, but I'm skeptical that it will help, see sections 1-3 — it wouldn't intervene on any of the dynamics that matter. In the meantime, our "clean genie" AI will be slow in the sense that it'll have to re-derive all of the environment-specific heuristics at runtime. We don't want to penalize it for that — that'd be synonymous with encouraging it to develop fast built-in heuristics, which is the opposite of what we want.