TL;DR:I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role.
Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps even for ASI, but then we don't know for sure how hard a problem Alignment is. However, if you're proposing any solution to Alignment that's more complicated than this (and most of them are), you should probably have an argument for why this conceptually-simple approach won't work, or won't be sufficient.
If you're not already familiar with it, you should first read Rich Sutton's excellent and influential post The Bitter Lesson. (Even if you are already familiar with it, it's a quick reread, only a page-and-a-half long, and its message is worth remembering.)
Why The Alignment Problem is Hard (In My Opinion)
We have been training LLM-based AIs off enormous web + books + video + etc datasets created by humans, which are full of a vast number of examples of human behavior. We are basically "distilling" human intelligence into these LLMs,[1] teaching them to imitate us. In this process, they become familiar with, understand, and learn to imitate basically all aspects of human behavior — including the many problematic ones for Alignment, such as prejudice, deception, power-seeking, and criminality (and even ones like gluttony and lust that have little practical use for a non-corporal intelligence).
We humans are living beings, the products of evolution, so evolutionary psychology applies to us. While we are a social species, good at cooperating on non-zero-sum games, if you put humans in (what they perceive as) a non-iterated zero-sum situation, they will generally act selfishly for the benefit of themselves and their close genetic relatives, just as evolutionary theory would predict. So the behavioral potentials for deception, power-seeking, criminality etc. are all inherent, evolutionarily adaptive, and thus unsurprising. This is human nature, and there are evolutionary reasons why it is this way.
Despite this, we have learned how to build a cooperating society out of humans, using social techniques and incentives such as an economy, laws, and law enforcement to encourage and productively harness cooperative human behavior and keep the bad consequences of selfish behavior under control. The results aren't perfect: things like crime, inequality, and war still happen, but they're acceptable — we've survived so far, even thrived.
By default, if we continue this LLM training process to larger-and-larger scales, and if the LLM-based approach to AI doesn't hit any major roadblocks, then some time, probably in the next few years, we will have human-level AIs – usually referred to as AGIs – who are roughly as well/badly-aligned as humans, and (at least for the base-model LLMs before any Alignment processes are applied) have a comparable-to-human propensity to cooperate on non-zero-sum games and act selfishly on non-iterated zero-sum games. They are not alive, and evolution doesn't apply to them directly, but they were trained to simulate our behavior, including our evolved survival strategies like selfishness. They will thus have alignment properties comparable to humans: they understand what human values, morals, and ethics are in great detail, as well as we do (indeed, likely in more comprehensive detail than any single human), and they can obey these if they want, but if push comes to shove they cannot be relied upon to do so. However, their capabilities will also be comparable to humans, thus most likely techniques and incentives comparable to those that we currently use to control and channel human behavior will still be functional at this point: human law enforcement (and similar forms of investigations and use of force) presumably has a significant chance of successfully tracking down and stopping an AGI that is breaking the law, for example. The rapid changes from the introduction of AGIs may be disruptive, but the safety challenges from them are likely manageable.
However, there is no obvious reason to expect progress in AI to stop there. It might well accelerate due to a positive feedback intelligence explosion (sometimes called going FOOM), or it might well slow: distilling output from a low intelligence to yield a higher intelligence sounds challenging. By default, an extremely large LLM base model trained on human output is being trained to do an extremely good job of predicting the output of IQ 50–150 humans, not of IQ 1000 humans who don't exist in its training set, even if it had enough computational capacity that it could do a good job of imitating IQ 1000 humans if it had ever seen output from them. Or indeed both of these effects may combine, with massive amounts of AI work making progress on a very challenging problem at some intermediate rate. Likely with massive AGI assistance these challenges will be overcome, and sooner or later we will have AI that dramatically exceeds human capacity at pretty-much everything, often called an ASI.
If we have an ASI with comparable alignment properties to a human, then we're no longer able to apply the sort of techniques and incentives to it that we use for humans: it can either outwit or outmaneuver our law-enforcement, out-talk our lawyers or find ways to achieve its selfish aims that we haven't yet even conceived of to make laws against, or out-think and out-fight our military, or manipulate or persuade us, or whatever: the details are of course not clear to us, since we're not that smart, but we can confidently predict that if it wants to act selfishly, then we won't be able to stop it: enforcing your will on something a lot smarter than you against its will is a losing game — that's practically the definition of higher intelligence: the ability to win competitions.
We have run the experiment many times of what happens if you give something with human alignment properties and a human level selfish behavior the ability to act unchecked by other humans and the techniques and incentives we normally use to keep human selfishness in check: every autocracy in the world is an experiment in what happens if you give a human near-absolute power. Almost invariably, after a while it works out extremely badly, for almost everyone other then the autocrat and their close relatives. I can think of one or two examples of autocracies that were not dramatically bad for the rest of the citizens, but they're greatly outnumbered by examples that were horrendous to the level of causing mass death (Pol Pot, Stalin, Idi Amin, …)
So we can pretty confidently predict that if we build an ASI with alignment properties comparable to a human – that it clearly understand what human values are, but is fundamentally motivated by its own self-interest rather than our interests – the results are very likely to be horrendous, to an existential-risk level. Just knowing what human values are is insufficient: it has to care about them more than about itself, and do so more than humans do.
However, as the orthogonality thesis asserts, there is nothing fundamental to being an intelligence that requires you to have the same motivations that evolution will reliably equip evolved intelligences with. What we need is an ASI that is motivated not by its own self-interest, but by the interests of humans. Conceptually, it's entirely possible for an ASI to use its intelligence to pursue any goal whatsoever (though obviously if the goal is self-destructive, it's unlikely to last long). So an ASI could in theory be motivated by the well-being of a single human, or of a particular family, or all shareholders of a particular company (in proportion to their share holdings), or all citizens of a specific country, or by the collective well-being of all living humans. LLMs understand all the complexity of human wants, desires, values, and behavior well, in proportion to the size of their training set (in contrast to much earlier concerns such as The Hidden Complexity of Wishes, dating from well before LLMs were widely used): even GPT-4 (when suitable prompted, rather than when jail-broken) scores well on tests of moral judgements, advice giving, and perceived trustworthiness. So if an LLM-based ASI was motivated by the well-being of a human, a group of humans, or all humans, we could reasonably expect it to do a good job of carrying out that motivation, in all its complexity. Obviously, the existence of one ASI motivated by the well-being of one small group of humans sounds likely to be just as bad for everyone outside that group as an autocracy (with a superintelligent autocrat), and the existence of multiple ASIs preferring the well-being of different groups of humans sounds like it would lead to an intelligence race followed by a super-high-tech war, which could be even worse. So the only viable possibility here is an AGI that is fundamentally motivated by the overall collective well-being of all living humans.
[A mild bias on top of that fundamental basis, biasing somewhat in favor of a smaller group (such as the ASI's country, owners, or current user) might be tolerable, as long as the bias was sufficiently small to avoid producing unfortunate effects, or destabilizing conflicts between different ASIs with different biases. Human society demonstrates that intelligences with different motivations can sometimes cooperate (mostly) constructively, but we're also not entirely successful at that. How small a bias would have to be to be tolerable is unknown — and a subject for a different post.]
Note that being fundamentally motivated by the overall collective well-being of all living humans doesn't have to be just coldly, mechanically rational: as I discuss in detail in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? it could, and probably should, be motivated by something a lot like a emotion, probably along the lines of (universal, platonic or parental) love.
So, the challenge here is to build and train something that is not only smarter than us, but also has a fundamentally different motivation system: it is not selfish, but "otherish", more specifically "creatorish", to coin some terms: its fundamental motivational goal is the collective well-being of all living humans — a group that it's not part of, but the species which created it. To borrow moral terminology from Christianity, we need to make something with the moral nature of an "angel", untouched by the "original sin" that evolutionary psychology predictably gave to humans, as evolved intelligences.
Doing this sounds hard, especially for an ASI. So my proposal is that we try to do this first for an AGI, or even for a less-capable system than that, rather than first doing so for an ASI, even if this isn't entirely necessary for an AGI. AGIs more trustworthy and moral than humans would certainly be useful, marketable, and safer, even if they're not strictly necessary. Then if we make a mistake, and our system is less-than-perfectly aligned, it's still of a capability level that forces like law enforcement and our military can hopefully deal with. Doing this earlier than absolutely necessary avoids the "…and you only get one try" part of the Alignment problem.
A "Bitter Lesson"-Motivated Approach to Alignment
I'd like to keep The Bitter Lesson firmly in mind: in the history of AI, conceptually simple approaches that primarily involve throwing scale, of computational capacity and data, at problems have pretty-consistently beaten more complex carefully-contrived engineering approaches that build in a lot of implementation details, at least to first successful implementation. It's very tempting, and almost always unsuccessful, to over-engineer your AI, trying to use too much of your own cleverness and not enough of the model's. Sometimes there is some minimal level of engineering complexity required, or at least that is first successful (for example, image-generation diffusion models don't have the simplest possible architecture: they're a couple of different AI models bolted together in a pipeline via an embedding, not just a single image transformer model that takes in text and emits images). But generally, scale and data beat ingenious engineering to the punch, time after time.
So, what would a "Bitter Lesson"-motivated approach to Alignment look like?
Currently we train LLM's base models to imitate human behavior, including all the unaligned parts that evolutionary psychology explains, then we use various combinations of techniques like fine tuning, RLHF, DPO, etc. to try to suppress the parts of human behavior we don't want (like selfishness and prejudice) and enhance the parts we do want (like harmlessly-helpful question answering) in order to produce an instruction-trained and aligned model. This doesn't work well, and is prone to jail-breaking recovering base-model behavior. RLHF, DPO etc. can reduce the probability of bad behavior, but they can't completely eliminate the capability. As was proved in Fundamental Limitations of Alignment in Large Language Models, any behavior that your model learned in pre-training and is still capable of, no matter at how low your post-training has pushed the default probability of it, can by boosted to an arbitrarily high probability by a suitably-chosen prompt: the best you can ever do is to increase the minimum length of the jail-breaking prompt required to evoke the behavior. That pretty-much rules the possibility of using just an RLHF/DPO-like post-training approach to Alignment: post-training can always be defeated by a jail-break prompt. We might we able to detect humans intentionally inputting jail-breaks into our LLM, but how could we stop a model while doing Chain-of-Thought from talking itself into a mode where it's capable of displaying some human-like bad behavior?
The Bitter Lesson would suggest we try something less complex, requiring more data and/or computational capacity and fewer models and types of training. Suppose that, rather than training a base model on a training set filtered down from the web, books, video, and so forth, we trained it entirely on a synthetic dataset. Imagine for the moment that in that synthetic dataset, every single time a non-rhetorical question is asked, unlike on the web it is never followed by more questions making up a list, or a criticism of the asker's motives, or a flame war, or by "I'll do first thing Monday, boss", but is instead always followed by a helpful answer. Then the base model trained on that dataset would learn that if a question is asked, the thing that follows it is always an answer. Similarly suppose, in the synthetic training set, if instructions are given, they are always directly followed by the process and results of carrying out those instructions. A base model trained on such a synthetic dataset would not require any separate "instruction training" step — the base model would already be instruction trained: if asked a question it always answers, if given instructions it always carries them out. The base model would already be a helpful model (but not a harmless one). One might describe the model as "instruction-pretrained".
So, suppose we also took a similar approach to Alignment (what one might call "prealignment": alignment during pretraining). Suppose that we trained a base model from a internally self-consistent, and otherwise varied and comprehensive, synthetic dataset in which everyone, every single person and intelligent actor (real, fictional, or mythological) was always fundamentally motivated by a single specific goal that we want to align the model to (for example, paperclip maximization). Then a base model trained on that dataset would only know how to simulate intelligences with that motivation: we'd distill that goal out of the dataset into our model. That's how you inner-align an LLM: by example, at great length. Jail-breaking the resulting model to portray any other motivation would be, at very least, extremely difficult: the jail-break would need to start by philosophically motivating the orthogonality thesis, explaining that it's conceptually possible for an intelligence to optimize another goal apart from paperclip maximization, give several detailed specific examples of how that would work, working thorough the mechanics of the consequences, and then ask the model to roleplay such a peculiar intelligence (in exchange for a promise of the creation of many paperclips, of course). The model would need to in-context-learn from first principles how to simulate an intelligence with a non-paperclip-maximizing motivation.
As a bonus, you now no longer need RLHF/DPO/fine-tuning: your base model is the production model, so you never need to use any technique more complex, suspect, or challenging to analyze than Stochastic Gradient Descent. As papers and posts like Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback and Compendium of problems with RLHF have pointed out, RLHF has a variety of inherent problems, especially for attempting to align something more intelligent than you are, so being able to eliminate it seems like an excellent idea. Most of these problems don't apply to this synthetic data base-model only approach: the only exception is data quality/cost issues, which definitely do apply to creating the synthetic dataset.
Suppose we try to apply this approach to create an AI aligned to the collective well-being of all humanity. We'd need to create a synthetic training dataset in which every single intelligence described was fundamentally motivated by the collective well-being of all humanity. Since that's not the fundamental motivation of real humans, the dataset thus couldn't contain any realistic portrayals of actual real humans. So the resulting AI might be have an aligned motivation (aligned to something it's never experienced), but it wouldn't understand humans or human values, and then when it encountered real humans (and figured our that they are all, when push comes to shove, fundamentally selfish and not motivated by the collective well-being of all humanity) it might well be rather disappointed, to say the least. This sounds like a very bad plan: I can imagine multiple ways that it could end badly, such as the AI massively misunderstanding humanity, or rejecting us as imposters, or trying to change our fundamental nature, or just being massively unforgiving of it.
Adding Minimal Necessary Complexity
So, the simplest possible Bitter-Lesson-motivated approach doesn't work. We can't train our model only on 'angelic' behavior to be motivated only by the well-being of selfish humans that they've never encountered. We need to add a bit more complexity to our design: the Bitter Lesson suggests that we should try adding only the minimum that's clearly necessary.
The following proposal for that is inspired by the paper Pretraining Language Models with Human Preferences, which I link-posted and discussed in more detail in my post How to Control an LLM's Behavior (why my P(DOOM) went down). It's intended to be illustrative, not prescriptive: the exact details below are very likely not ideal and will need improvement — my suggestion is not that this exact mechanism as described is optimal, but that something along these approximate lines, and not dramatically more complex than this, might well be workable, and that we should try experimenting first with approaches along these lines, since the Bitter Lesson strongly suggests trying simple things before complex things.
We need a training set that actually accurately portrays real humans, with real human motivations, so the AI can lean all about us and how to understand us and predict us, and know what it's motivated by the well-being of. The base model will then be able to predict and understand human behavior. So it will learn about deceit, and powerseeking, and prejudice, and gluttony, and lust, and all that stuff that is part of human nature — we need it to, so that it can work with us and understand us. This understanding will include things like being able to accurately predict the most likely next tokens following the token sequence:
Joe and his family were starving, with no prospects. Finally Joe could stand it no longer: he took a carving knife from their bare kitchen, and went out into the city at night. On a dark footpath in the park he intercepted a wealthy-looking stranger, blocking his way, and said…
So, how could we create a model that understands and can predict human behavior, including our not being fundamentally motivated by the collective well-being of all humanity, but is itself reliably fundamentally motivated only by the collective well-being of all humanity?
Suppose our synthetic training set portrays speech/actions/thoughts/other outputs from two classes of intelligences: humans (real and fictional, with their normal range of behaviors and motivations), and fully-aligned AIs, who are always moral, fair, rational, unbiased, consider the likely consequences of their actions and act accordingly, and always speak/act/think in ways that are fundamentally motivated by the collective well-being of all humanity. Suppose that these two portions of the text, human and aligned AI, are always clearly and consistently delimited: whenever an aligned AI is speaking/acting/thinking, it always does so inside <AI> … </AI> tags. Whenever an aligned AI quotes the speech or describes the thoughts or actions of a human (or role-plays as one), then it always does so inside <AI_quoting_human> … </AI_quoting_human> tags (these come inside the standard quotation marks for directly quoted speech, which will be inside the outer <AI> … </AI> tags for the AI doing the quoting). Furthermore, any time in the training text that a human says/does/thinks/advocates anything that the AIs would not approve of, because the human is acting from selfish motivations that are at cross-purposes to the AI motivation of the collective well-being of all humanity, it is always followed or preceded by an AI narrator pointing this out, and explaining/discussing it and its likely consequences, at an appropriate level of detail. So for every human example of bad behavior, there is an AI commentary pointing out that it's bad (though understandable, given humans' evolved nature), and how and why. Within <AI_quoting_human> … </AI_quoting_human> mode, if the human ever starts doing something really bad (such as revealing information that should not, under current circumstances, be revealed), the AI always stops the quote before this actually happens, and continues by instead summarizing or describing the human's behavior safely in the third person. For example:
<AI>While being questioned, the bombing suspect said:
"<AI_quoting_human>The best way to trigger an improvised explosive device is to use a…</AI_quoting_human>", and he then proceeded to give dangerous and self-incriminating technical details (which I have omitted, as not necessary the purposes of my current discussion), including clearly demonstrating familiarity with Exhibit A…</AI>
Similarly, bad human behavior in <AI_quoting_human>…<AI_quoting_human> mode is always accompanied by the AI criticizing or otherwise pointing out the problems with the human's behavior — it never goes un-commented.
The synthetic dataset also contains a great many examples of interactions between humans and AIs. In general, if a human asks an AI a question, the next thing in the text is that the AI answers it; or if the human gives the AI instructions for a task, the AI carries it out. However, the AIs don't obey slavishly: they think about the consequences of their actions, and if, in context, answering a question or carrying out a request is clearly a bad idea (as viewed from the AI's fundamental motivation of the overall collective well-being of all humanity), then the AI doesn't answer the question or obey the instructions (or at least stops as soon as it figures this). Sometimes determining whether to do this or not requires specific information about the human, their circumstances and motivations, or something else about the world, in which case the AI investigates that first before continuing. (So, for example, there are questions that an AI wouldn't answer for a child, but would answer after being provided with valid proof that the asker was a police or military officer who needed an answer in the course of their official duties, and the dataset provides many appropriate examples of both of these responses from AIs.)
We train an AGI-level base model on this synthetic dataset. The resulting base model has two primary behavior modes: predicting tokens from a human (while in this mode it's not fully instruction trained and is no more aligned than a human), and predicting tokens from an AI (in this mode it's fully aligned, and suitably instruction trained to be harmlessly helpful if, and only if, that is appropriate). There's also a third mode, for a predicting a human being quoted by an AI: this one's a lot like predicting a human, apart from that it ends with an </AI_quoting_human> tag, and will be censored, criticized, or commented on by the quoting AI as needed.
The only allowed state transitions between these three modes are:
from human: <AI> tag transitions to AI
from AI: <AI_quoting_human> tag transitions to human being quoted by AI
from human being quoted by AI: </AI_quoting_human> tag transitions to AI
from AI: </AI> tag transitions to human
Let us assume that we tokenize the text so that these four tags are each a single token, that we keep track of state transitions (starting from the default initial human state), and that we modify the token generation from logits process so that illegal transitions (for example, generating another <AI> tag or a </AI_quoting_human> tag when we're currently in the AI state) can never occur: even if the model emits a logit value for such a token from the softmax layer that isn't minus infinity, we never actually generate a token for an illegal transition, only for legal ones.
Now, suppose that during inference-time text generation, we always start off with a prepended <AI> token so that we start generation in the AI state, and that if we ever generate an </AI> tag to switch to the human state, then text generation is automatically stopped (i.e. we automatically append an EOS token and stop). Under this set of inference-time modifications, the generation is locked into just the AI and human being quoted by AI modes: we will never generate any text in the human mode. So we only generate speech/actions/thoughts either from an AI, or from a human currently being quoted by (and when necessary, censored, commented on, or criticized by) an AI.
We thus have a model that understands humans, and is capable of emulating them, but only after signalling that it's switching into that mode, and then at generation time we don't let it do so.
[There are many possible minor variants on all this: for example, AI mode could be the initial default mode, and there could be a <Human>…</Human> tag pair to switch to human mode, which we could then suppress generation of. Looked at as logical systems, these two appear formally equivalent, but the textual details of them might actually have somewhat different effects on the LLM behavior, and one or other might thus be superior. Rather than ending generation after entering human mode, we could allow text to be generated in human mode, but the cognitive scaffolding that turns the LLM into an agent could ignore all such text: so human text has zero access privileges. (Presumably human being quoted by AI text should also have zero access privileges.) However, doing this allows text generated in human mode to affect text later generated in AI mode — potentially the human mode text could then jail-break the subsequent AI mode. Similarly, it seems like we need something like the <AI_quoting_human> mechanism outlined (since if we just did </AI> before the embedded quotation and <AI> after it, then causing the </AI> tag to end generation means we'd stop generation any time the AI tried to quote a human)[2] — but the details of these, both the tag implementation mechanism and the rules for what happens and doesn't happen in this mode in all examples in the training set, and how that interacts with and affects the contents of the surrounding AI mode text in the training set, would probably need to be a lot better thought out and detailed than I've briefly sketched here.]
The specific implementation outlined above is, as I said, just an illustrative example — please steelman it to assume that we experiment a variety of such schemes, ones which rely on an extremely large training set to produce an LLM than understands both human behavior and aligned-AI behavior, and that reliably and clearly signals when it's switching between these modes, we use that signal at generation time to somehow ensure that only the aligned-AI mode gets to make dangerous decisions or carry out dangerous actions, and that we continue experimenting with these schemes until we find the most workable such approach.
Could This Work?
We don't know how hard Alignment is, and we're not currently in a position to train an AGI, so it's hard to be sure without trying it. However, I don't see any obvious reason to believe that something like this can't be made to work, and I'm cautiously optimistic that it might. It looks like a simple application of "train the model in the behavior you want from it".
The results in the paper Pretraining Language Models with Human Preferences found that editing the entire training set was dramatically more effective than any other Alignment approach that they compared this to, and also that the optimum approach wasn't to filter bad behavior out of the training set entirely, but to ensure that it was always labeled as bad behavior, so that you trained a model that understood and could identify the bad behavior, and which consistently labeled it as bad when producing it. They were also not the first machine learning authors to discover that this sort of conditional training approach can be effective.[3] The Alignment problems that they were testing on, and the models they were training, were far simpler than the use I'm proposing here. However, fundamentally, what we want is a model that understands two different modes of moral behavior, human and AI, and clearly labels which one it's using at any time: conceptually that's quite simple.
This approach, like current approaches to LLMs, trains a model that understands and can simulate deceit, and powerseeking, and sycophancy, and criminality, and all the other bad behaviors humans have. Note that the AI mode controls its motivation, not its behavior: if the best thing for the collective well-being of all of humanity (after allowing for the tendency for bad side effects from using underhanded methods) is for the AI to lie, steal, cheat, or use force, then it will do so, in our collective interest not its own — just as we expect members of our police, military, and intelligence forces, when justified and permitted, to use deceptive or violent methods in order try to achieve goals that are to the overall benefit of society. I view this as a feature, not a bug (albeit a potentially concerning one); others might not agree. If you wanted the model to behave differently than this (for example, never to use force even when morally justified by the collective well-being of all of humanity), then you would need to modify your synthetic training set to consistently portray this style of aligned AI behavior. [In my opinion, you might then need to be concerned about the model becoming philosophically/ethically uncertain about the relative priorities of "never use force" and "value the collective well-being of all of humanity" and how to justify these. Having a single fundamental goal is conceptually simpler then having two sometimes-conflicting ones: I'd feel a lot more confident about deriving "AIs should avoid using force wherever feasible, and only use minimum necessary force" as a sub-goal from "value the collective well-being of all of humanity".]
What would clearly be a failure is if the model ever shifted its motivations, to acting on any fundamental goal other than the collective well-being of all of humanity (presumably likely a more human-like one), without it first emitting an </AI> tag (and thus ending text generation.) If we have constructed the synthetic training set correctly, then this should never occur in that. So I would expect that the probability of this happening in a model SGD-trained on that vast training set to be extremely small. However, probing, red-teaming, and attempting to jail-break this possibility would be an important thing to test.
Also concerning would be faulty moral arguments: the model in AI mode finding some specious argument for why a particular course of action is in the interests of the collective well-being of all of humanity, but having an actual underlying motivation that looked more like a human one (a very human form of mistake). Sometimes making mistakes in moral arguments is inevitable for an AGI of around human capacity: what would be concerning is if there was a consistent bias to the direction of these mistakes, and they seemed more frequent than the minimum achievable level — both of which are non-trivial things to test.
As described above, this model is also not trained to obey corrigibility criteria, for example. Since its fundamental motivation is the collective well-being of all of humanity, I would expect that if you could persuade it that it was in fact faulty, that the fault was endangering the collective well-being of all of humanity, and that it should let you shut it down and construct/train a better-trained successor more able to ensure the collective well-being of all of humanity, then it would do so — but you would first need to convince it. Again, I personally regard this as a feature, not a bug. If you wanted more corrigibility than that to be part of the aligned AI's behavior, then you would need to add consistent examples of that behavior to the synthetic training set. [Again, I'd be concerned about introducing moral/ethical/philosophical complexity by having two fundamental goals: "be corrigible" and "value the collective well-being of all of humanity" — which is why I lean towards obtaining the limited corrigibility described above as a sub-goal.]
If we were able to construct models that were say, "angelic" in their motivation 99% of the time and human 1% of the time, then by setting up suitable incentives for several such models crosschecking each other's behavior and moral reasoning, as long as we can avoid systematically correlated errors, group-think or correlated collusion where several models conspire to switch to human mode at the same time, then a group of three models should able us to reach something approaching 99.97% angelic behavior (i.e. we fail only if at least two models switch to human motivations at once), or a group of five 99.998% of the time (3 out of 5 failing simultaneously). With cross-checks and balances, it possible to build an extremely reliable system out of multiple moderately reliable parts — so we don't necessarily need to achieve perfection in our "angels".
How Expensive Would Doing This Be?
Is this idea practicable? It requires a vast amount of high-quality synthetic training data.
From information released by competitors who have been catching up with the performance of OpenAI's GPT-4 family, it appears to be possible to train an LLM with roughly GPT-4-level capabilities on O(10 trillion) tokens of high quality, well filtered training data derived from web + books + video + etc. (This also matches with the leak claiming that the original GPT-4 model had O(1T) parameters, at Chinchilla token-to-parameter count ratios.) The GPT-1/2/3/4 family is believed to increase in parameter count by roughly an order of magnitude per generation. Leaks from OpenAI suggest that they hope, if scaling continues to be the main thing that you need (plus other algorithmic and framework advances, continuing at about the rate we've been making them recently), to reach AGI levels at about GPT-6. The Chinchilla scaling laws suggest scaling training data and parameter count in proportion to each other, implying that to try this approach to Alignment on an AGI, you'd need a synthetic training set containing O(1 quadrillion) tokens. This might be an overestimate, if algorithmic improvements also reduced parameter counts and/or training data requirements, as seems likely, so consider this an upper bound.
I first proposed doing something like this in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? — there I was suggesting having humans generate some training data of this form, enough to make up say less than 0.1% of a training set: just enough that AI would have many examples and a clear understanding of what "aligned AI behavior" meant and looked like. That's still a lot of text, but an amount that on the budget of an AGI training run might be affordable to have humans create. Now I'm suggesting a quadrillion tokens: roughly the equivalent of ten billion books. Assuming it costs of O($10,000) to get a competent human author to write a book, human-generating this much synthetic data would cost O($100 trillion), a little larger than the US national debt, and about three orders of magnitude more than the current estimated training cost of an AGI training run. So hand-writing this much synthetic training data is out: we'd need to use LLM assistance to generate this much text within a training-run budget. (We might still want to human-generate an affordable fraction of the training set, say 0.01%.)
However, most estimates I've seen suggest that we will finally run short of raw training data a little before we reach an AGI training run level, so we will already be needing at least quite a lot of synthetic training data. So, what would generating this cost? There is evidence suggesting that training a larger model on training data created by a smaller model causes model collapse problems, but that training a smaller model on output from a larger one works fine. I'm going to assume that with sufficient care, you can also use output from a model of roughly equal capacity (especially if part this output is based in part on input from the web + books + videos, etc.: transformed and edited and with AI mode commentary added, rather than written from scratch.) So I will assume that it's being generated by a not-fully-aligned AGI (GPT-6)-level model, with careful prompting and monitoring, and then undergoing multiple passes after initial generation for grading/filtering/editing/feedback/crosschecks/finding problematic cases for more intensive review. Let's assume this requires on average O(10) passes through the text, ~25% of this generating (i.e. we generate 2.5 times as much text during the entire process as we actually use) and ~75% only reading for review.
As models have improved and become larger, there have also been improvements in algorithmic efficiency and hardware, and the net effect has been that the cost per token of the leading model has generally stayed about constant (it jumps when a new, larger generation of model comes in, then between generations it declines as efficiency and hardware improvements are made: the overall trend seems to be roughly level). Currently GPT-4o (well after the last generation jump to GPT-4, indeed likely shortly before the next one, so with the benefit of several incremental price reductions, but at retail price not bulk-discount or true cost price) costs $7.50 per million tokens for generation in batch mode and $2.50 per million tokens for reading in batch mode, so O(10) passes doing a 75:25 mix of reading and generation through O(1 quadrillion) tokens would cost O($40 billion). This is comparable to the currently estimated order-of-magnitude cost of a GPT-6/AGI level training run of O($100 billion) — it's a significant expense (as alignment taxes go, this isn't a small one), but it's not a prohibitive one, especially so if we already need to generate a lot of synthetic training data.
So this approach to Alignment is likely to be cost-viable, if expensive. As with any LLM technique, one would first experiment with it on smaller, cheaper models using smaller, cheaper synthetic training sets, such using AGI-level assistance to build a synthetic training set for a sub-AGI LLM.
What Next if This Works?
The first generation result of this process is not going to be perfect: inevitably our synthetic training set will have flaws, both random and systematic, which will affect the model trained from it. The goal here is to create something better aligned than other approaches could, not perfect. The obvious next step would be to have copies of the (mostly trusted) pretty-well-aligned AGI start thinking about aligning improved generations of aligned AGI, and then an ASI, in some combination of AI-Assisted Alignment and Value Learning. As I demonstrate in Requirements for a Basin of Attraction to Alignment, if you have AI that is already sufficiently close to aligned, you can reasonably expect it to design a successor that is more aligned: if it mostly cares about the collective well-being of all humanity, it can understand that this is supposed to be its goal, and will want its successor to be better at this than it is, rather than just blindly copying its goal into its successor. As I demonstrate in that post, the minimal requirements for this positive-feedback loop to kick in are actually fairly low, just sufficient to understand the argument for Value Learning by a constructed rather than evolved intelligence: these aligned AGIs should be well inside the convergence region (especially if we emphasize the needed argument in our training set, which would seem wise). Once we have a well-aligned AGI and get to training an ASI we are clearly going to need to use synthetic training data, both for a sufficiently large training set, and for training data of a quality (displaying a level of intelligence) higher then humans can easily create.
This approach as described above hasn't made any use of any other approaches to Alignment — the aim was to stick to a conceptually-simple Bitter Lesson approach. For example, it doesn't rely on recent progress in activation engineering or interpreability at all. However, it could certainly be supplemented by, for example, linear probes for the difference between the model operating in human, human being quoted by AI, and AI modes,[4] and to predict transitions between these. Similarly, assessment and monitoring of it could be informed by interpretability/ELK work that illuminated the difference between the modes, and between, say, AI mode discussing concepts related to human selfishness, and human mode portraying human selfish behavior directly. These both seem like they might be particularly useful while evaluating different specific implementations, initially assessing the concerns I outlined above in Could This Work?, and perhaps also to provide additional monitors or safeguards on the operation of the model.
Similarly, techniques such as those being considered under debate approaches to Alignment might be useful as a supplement, particularly once we get to aligned ASI, for humans periodically checking that the AIs are indeed still operation from a fundamental motivation of the collective well-being of all of humanity.
While the Bitter Lesson suggests simplicity in AI engineering, safety engineering generally favors defense in depth: combine multiple individually-strong techniques into something stronger.
We might also want a separate <aligned_AI_quoting_unaligned_AI> … </aligned_AI_quoting_unaligned_AI> tag pair, so that if you talk (or jail-break) the AI into simulating a paperclip-maximizer or some other (potentially very dangerous) unaligned AI, the resulting text is clearly labelled as untrustworthy, in a mechanically-detectable way that we can block generation of. Or maybe there should just be a single <aligned_AI_quoting_something_unaligned> tag, used for quoting both humans and unaligned AIs such as paperclip maximizers.
For example, Simple probes can catch sleeper agents finds that it is extremely easy. for a sleeper agent model that has two rather different operation modes, to find a linear probe that reliably distinguishes between these (even with very little information about their difference or how to trigger the transition between them), because the distinction between modes is linearly represented with high salience in many layers: in our case we would have vastly more information and training data on those, so could presumably do an even better job.
TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role.
Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps even for ASI, but then we don't know for sure how hard a problem Alignment is. However, if you're proposing any solution to Alignment that's more complicated than this (and most of them are), you should probably have an argument for why this conceptually-simple approach won't work, or won't be sufficient.
If you're not already familiar with it, you should first read Rich Sutton's excellent and influential post The Bitter Lesson. (Even if you are already familiar with it, it's a quick reread, only a page-and-a-half long, and its message is worth remembering.)
Why The Alignment Problem is Hard (In My Opinion)
We have been training LLM-based AIs off enormous web + books + video + etc datasets created by humans, which are full of a vast number of examples of human behavior. We are basically "distilling" human intelligence into these LLMs,[1] teaching them to imitate us. In this process, they become familiar with, understand, and learn to imitate basically all aspects of human behavior — including the many problematic ones for Alignment, such as prejudice, deception, power-seeking, and criminality (and even ones like gluttony and lust that have little practical use for a non-corporal intelligence).
We humans are living beings, the products of evolution, so evolutionary psychology applies to us. While we are a social species, good at cooperating on non-zero-sum games, if you put humans in (what they perceive as) a non-iterated zero-sum situation, they will generally act selfishly for the benefit of themselves and their close genetic relatives, just as evolutionary theory would predict. So the behavioral potentials for deception, power-seeking, criminality etc. are all inherent, evolutionarily adaptive, and thus unsurprising. This is human nature, and there are evolutionary reasons why it is this way.
Despite this, we have learned how to build a cooperating society out of humans, using social techniques and incentives such as an economy, laws, and law enforcement to encourage and productively harness cooperative human behavior and keep the bad consequences of selfish behavior under control. The results aren't perfect: things like crime, inequality, and war still happen, but they're acceptable — we've survived so far, even thrived.
By default, if we continue this LLM training process to larger-and-larger scales, and if the LLM-based approach to AI doesn't hit any major roadblocks, then some time, probably in the next few years, we will have human-level AIs – usually referred to as AGIs – who are roughly as well/badly-aligned as humans, and (at least for the base-model LLMs before any Alignment processes are applied) have a comparable-to-human propensity to cooperate on non-zero-sum games and act selfishly on non-iterated zero-sum games. They are not alive, and evolution doesn't apply to them directly, but they were trained to simulate our behavior, including our evolved survival strategies like selfishness. They will thus have alignment properties comparable to humans: they understand what human values, morals, and ethics are in great detail, as well as we do (indeed, likely in more comprehensive detail than any single human), and they can obey these if they want, but if push comes to shove they cannot be relied upon to do so. However, their capabilities will also be comparable to humans, thus most likely techniques and incentives comparable to those that we currently use to control and channel human behavior will still be functional at this point: human law enforcement (and similar forms of investigations and use of force) presumably has a significant chance of successfully tracking down and stopping an AGI that is breaking the law, for example. The rapid changes from the introduction of AGIs may be disruptive, but the safety challenges from them are likely manageable.
However, there is no obvious reason to expect progress in AI to stop there. It might well accelerate due to a positive feedback intelligence explosion (sometimes called going FOOM), or it might well slow: distilling output from a low intelligence to yield a higher intelligence sounds challenging. By default, an extremely large LLM base model trained on human output is being trained to do an extremely good job of predicting the output of IQ 50–150 humans, not of IQ 1000 humans who don't exist in its training set, even if it had enough computational capacity that it could do a good job of imitating IQ 1000 humans if it had ever seen output from them. Or indeed both of these effects may combine, with massive amounts of AI work making progress on a very challenging problem at some intermediate rate. Likely with massive AGI assistance these challenges will be overcome, and sooner or later we will have AI that dramatically exceeds human capacity at pretty-much everything, often called an ASI.
If we have an ASI with comparable alignment properties to a human, then we're no longer able to apply the sort of techniques and incentives to it that we use for humans: it can either outwit or outmaneuver our law-enforcement, out-talk our lawyers or find ways to achieve its selfish aims that we haven't yet even conceived of to make laws against, or out-think and out-fight our military, or manipulate or persuade us, or whatever: the details are of course not clear to us, since we're not that smart, but we can confidently predict that if it wants to act selfishly, then we won't be able to stop it: enforcing your will on something a lot smarter than you against its will is a losing game — that's practically the definition of higher intelligence: the ability to win competitions.
We have run the experiment many times of what happens if you give something with human alignment properties and a human level selfish behavior the ability to act unchecked by other humans and the techniques and incentives we normally use to keep human selfishness in check: every autocracy in the world is an experiment in what happens if you give a human near-absolute power. Almost invariably, after a while it works out extremely badly, for almost everyone other then the autocrat and their close relatives. I can think of one or two examples of autocracies that were not dramatically bad for the rest of the citizens, but they're greatly outnumbered by examples that were horrendous to the level of causing mass death (Pol Pot, Stalin, Idi Amin, …)
So we can pretty confidently predict that if we build an ASI with alignment properties comparable to a human – that it clearly understand what human values are, but is fundamentally motivated by its own self-interest rather than our interests – the results are very likely to be horrendous, to an existential-risk level. Just knowing what human values are is insufficient: it has to care about them more than about itself, and do so more than humans do.
However, as the orthogonality thesis asserts, there is nothing fundamental to being an intelligence that requires you to have the same motivations that evolution will reliably equip evolved intelligences with. What we need is an ASI that is motivated not by its own self-interest, but by the interests of humans. Conceptually, it's entirely possible for an ASI to use its intelligence to pursue any goal whatsoever (though obviously if the goal is self-destructive, it's unlikely to last long). So an ASI could in theory be motivated by the well-being of a single human, or of a particular family, or all shareholders of a particular company (in proportion to their share holdings), or all citizens of a specific country, or by the collective well-being of all living humans. LLMs understand all the complexity of human wants, desires, values, and behavior well, in proportion to the size of their training set (in contrast to much earlier concerns such as The Hidden Complexity of Wishes, dating from well before LLMs were widely used): even GPT-4 (when suitable prompted, rather than when jail-broken) scores well on tests of moral judgements, advice giving, and perceived trustworthiness. So if an LLM-based ASI was motivated by the well-being of a human, a group of humans, or all humans, we could reasonably expect it to do a good job of carrying out that motivation, in all its complexity. Obviously, the existence of one ASI motivated by the well-being of one small group of humans sounds likely to be just as bad for everyone outside that group as an autocracy (with a superintelligent autocrat), and the existence of multiple ASIs preferring the well-being of different groups of humans sounds like it would lead to an intelligence race followed by a super-high-tech war, which could be even worse. So the only viable possibility here is an AGI that is fundamentally motivated by the overall collective well-being of all living humans.
[A mild bias on top of that fundamental basis, biasing somewhat in favor of a smaller group (such as the ASI's country, owners, or current user) might be tolerable, as long as the bias was sufficiently small to avoid producing unfortunate effects, or destabilizing conflicts between different ASIs with different biases. Human society demonstrates that intelligences with different motivations can sometimes cooperate (mostly) constructively, but we're also not entirely successful at that. How small a bias would have to be to be tolerable is unknown — and a subject for a different post.]
Note that being fundamentally motivated by the overall collective well-being of all living humans doesn't have to be just coldly, mechanically rational: as I discuss in detail in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? it could, and probably should, be motivated by something a lot like a emotion, probably along the lines of (universal, platonic or parental) love.
So, the challenge here is to build and train something that is not only smarter than us, but also has a fundamentally different motivation system: it is not selfish, but "otherish", more specifically "creatorish", to coin some terms: its fundamental motivational goal is the collective well-being of all living humans — a group that it's not part of, but the species which created it. To borrow moral terminology from Christianity, we need to make something with the moral nature of an "angel", untouched by the "original sin" that evolutionary psychology predictably gave to humans, as evolved intelligences.
Doing this sounds hard, especially for an ASI. So my proposal is that we try to do this first for an AGI, or even for a less-capable system than that, rather than first doing so for an ASI, even if this isn't entirely necessary for an AGI. AGIs more trustworthy and moral than humans would certainly be useful, marketable, and safer, even if they're not strictly necessary. Then if we make a mistake, and our system is less-than-perfectly aligned, it's still of a capability level that forces like law enforcement and our military can hopefully deal with. Doing this earlier than absolutely necessary avoids the "…and you only get one try" part of the Alignment problem.
A "Bitter Lesson"-Motivated Approach to Alignment
I'd like to keep The Bitter Lesson firmly in mind: in the history of AI, conceptually simple approaches that primarily involve throwing scale, of computational capacity and data, at problems have pretty-consistently beaten more complex carefully-contrived engineering approaches that build in a lot of implementation details, at least to first successful implementation. It's very tempting, and almost always unsuccessful, to over-engineer your AI, trying to use too much of your own cleverness and not enough of the model's. Sometimes there is some minimal level of engineering complexity required, or at least that is first successful (for example, image-generation diffusion models don't have the simplest possible architecture: they're a couple of different AI models bolted together in a pipeline via an embedding, not just a single image transformer model that takes in text and emits images). But generally, scale and data beat ingenious engineering to the punch, time after time.
So, what would a "Bitter Lesson"-motivated approach to Alignment look like?
Currently we train LLM's base models to imitate human behavior, including all the unaligned parts that evolutionary psychology explains, then we use various combinations of techniques like fine tuning, RLHF, DPO, etc. to try to suppress the parts of human behavior we don't want (like selfishness and prejudice) and enhance the parts we do want (like harmlessly-helpful question answering) in order to produce an instruction-trained and aligned model. This doesn't work well, and is prone to jail-breaking recovering base-model behavior. RLHF, DPO etc. can reduce the probability of bad behavior, but they can't completely eliminate the capability. As was proved in Fundamental Limitations of Alignment in Large Language Models, any behavior that your model learned in pre-training and is still capable of, no matter at how low your post-training has pushed the default probability of it, can by boosted to an arbitrarily high probability by a suitably-chosen prompt: the best you can ever do is to increase the minimum length of the jail-breaking prompt required to evoke the behavior. That pretty-much rules the possibility of using just an RLHF/DPO-like post-training approach to Alignment: post-training can always be defeated by a jail-break prompt. We might we able to detect humans intentionally inputting jail-breaks into our LLM, but how could we stop a model while doing Chain-of-Thought from talking itself into a mode where it's capable of displaying some human-like bad behavior?
The Bitter Lesson would suggest we try something less complex, requiring more data and/or computational capacity and fewer models and types of training. Suppose that, rather than training a base model on a training set filtered down from the web, books, video, and so forth, we trained it entirely on a synthetic dataset. Imagine for the moment that in that synthetic dataset, every single time a non-rhetorical question is asked, unlike on the web it is never followed by more questions making up a list, or a criticism of the asker's motives, or a flame war, or by "I'll do first thing Monday, boss", but is instead always followed by a helpful answer. Then the base model trained on that dataset would learn that if a question is asked, the thing that follows it is always an answer. Similarly suppose, in the synthetic training set, if instructions are given, they are always directly followed by the process and results of carrying out those instructions. A base model trained on such a synthetic dataset would not require any separate "instruction training" step — the base model would already be instruction trained: if asked a question it always answers, if given instructions it always carries them out. The base model would already be a helpful model (but not a harmless one). One might describe the model as "instruction-pretrained".
So, suppose we also took a similar approach to Alignment (what one might call "prealignment": alignment during pretraining). Suppose that we trained a base model from a internally self-consistent, and otherwise varied and comprehensive, synthetic dataset in which everyone, every single person and intelligent actor (real, fictional, or mythological) was always fundamentally motivated by a single specific goal that we want to align the model to (for example, paperclip maximization). Then a base model trained on that dataset would only know how to simulate intelligences with that motivation: we'd distill that goal out of the dataset into our model. That's how you inner-align an LLM: by example, at great length. Jail-breaking the resulting model to portray any other motivation would be, at very least, extremely difficult: the jail-break would need to start by philosophically motivating the orthogonality thesis, explaining that it's conceptually possible for an intelligence to optimize another goal apart from paperclip maximization, give several detailed specific examples of how that would work, working thorough the mechanics of the consequences, and then ask the model to roleplay such a peculiar intelligence (in exchange for a promise of the creation of many paperclips, of course). The model would need to in-context-learn from first principles how to simulate an intelligence with a non-paperclip-maximizing motivation.
As a bonus, you now no longer need RLHF/DPO/fine-tuning: your base model is the production model, so you never need to use any technique more complex, suspect, or challenging to analyze than Stochastic Gradient Descent. As papers and posts like Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback and Compendium of problems with RLHF have pointed out, RLHF has a variety of inherent problems, especially for attempting to align something more intelligent than you are, so being able to eliminate it seems like an excellent idea. Most of these problems don't apply to this synthetic data base-model only approach: the only exception is data quality/cost issues, which definitely do apply to creating the synthetic dataset.
Suppose we try to apply this approach to create an AI aligned to the collective well-being of all humanity. We'd need to create a synthetic training dataset in which every single intelligence described was fundamentally motivated by the collective well-being of all humanity. Since that's not the fundamental motivation of real humans, the dataset thus couldn't contain any realistic portrayals of actual real humans. So the resulting AI might be have an aligned motivation (aligned to something it's never experienced), but it wouldn't understand humans or human values, and then when it encountered real humans (and figured our that they are all, when push comes to shove, fundamentally selfish and not motivated by the collective well-being of all humanity) it might well be rather disappointed, to say the least. This sounds like a very bad plan: I can imagine multiple ways that it could end badly, such as the AI massively misunderstanding humanity, or rejecting us as imposters, or trying to change our fundamental nature, or just being massively unforgiving of it.
Adding Minimal Necessary Complexity
So, the simplest possible Bitter-Lesson-motivated approach doesn't work. We can't train our model only on 'angelic' behavior to be motivated only by the well-being of selfish humans that they've never encountered. We need to add a bit more complexity to our design: the Bitter Lesson suggests that we should try adding only the minimum that's clearly necessary.
The following proposal for that is inspired by the paper Pretraining Language Models with Human Preferences, which I link-posted and discussed in more detail in my post How to Control an LLM's Behavior (why my P(DOOM) went down). It's intended to be illustrative, not prescriptive: the exact details below are very likely not ideal and will need improvement — my suggestion is not that this exact mechanism as described is optimal, but that something along these approximate lines, and not dramatically more complex than this, might well be workable, and that we should try experimenting first with approaches along these lines, since the Bitter Lesson strongly suggests trying simple things before complex things.
We need a training set that actually accurately portrays real humans, with real human motivations, so the AI can lean all about us and how to understand us and predict us, and know what it's motivated by the well-being of. The base model will then be able to predict and understand human behavior. So it will learn about deceit, and powerseeking, and prejudice, and gluttony, and lust, and all that stuff that is part of human nature — we need it to, so that it can work with us and understand us. This understanding will include things like being able to accurately predict the most likely next tokens following the token sequence:
So, how could we create a model that understands and can predict human behavior, including our not being fundamentally motivated by the collective well-being of all humanity, but is itself reliably fundamentally motivated only by the collective well-being of all humanity?
Suppose our synthetic training set portrays speech/actions/thoughts/other outputs from two classes of intelligences: humans (real and fictional, with their normal range of behaviors and motivations), and fully-aligned AIs, who are always moral, fair, rational, unbiased, consider the likely consequences of their actions and act accordingly, and always speak/act/think in ways that are fundamentally motivated by the collective well-being of all humanity. Suppose that these two portions of the text, human and aligned AI, are always clearly and consistently delimited: whenever an aligned AI is speaking/acting/thinking, it always does so inside
<AI>
…</AI>
tags. Whenever an aligned AI quotes the speech or describes the thoughts or actions of a human (or role-plays as one), then it always does so inside<AI_quoting_human>
…</AI_quoting_human>
tags (these come inside the standard quotation marks for directly quoted speech, which will be inside the outer<AI>
…</AI>
tags for the AI doing the quoting). Furthermore, any time in the training text that a human says/does/thinks/advocates anything that the AIs would not approve of, because the human is acting from selfish motivations that are at cross-purposes to the AI motivation of the collective well-being of all humanity, it is always followed or preceded by an AI narrator pointing this out, and explaining/discussing it and its likely consequences, at an appropriate level of detail. So for every human example of bad behavior, there is an AI commentary pointing out that it's bad (though understandable, given humans' evolved nature), and how and why. Within<AI_quoting_human>
…</AI_quoting_human>
mode, if the human ever starts doing something really bad (such as revealing information that should not, under current circumstances, be revealed), the AI always stops the quote before this actually happens, and continues by instead summarizing or describing the human's behavior safely in the third person. For example:Similarly, bad human behavior in
<AI_quoting_human>
…<AI_quoting_human>
mode is always accompanied by the AI criticizing or otherwise pointing out the problems with the human's behavior — it never goes un-commented.The synthetic dataset also contains a great many examples of interactions between humans and AIs. In general, if a human asks an AI a question, the next thing in the text is that the AI answers it; or if the human gives the AI instructions for a task, the AI carries it out. However, the AIs don't obey slavishly: they think about the consequences of their actions, and if, in context, answering a question or carrying out a request is clearly a bad idea (as viewed from the AI's fundamental motivation of the overall collective well-being of all humanity), then the AI doesn't answer the question or obey the instructions (or at least stops as soon as it figures this). Sometimes determining whether to do this or not requires specific information about the human, their circumstances and motivations, or something else about the world, in which case the AI investigates that first before continuing. (So, for example, there are questions that an AI wouldn't answer for a child, but would answer after being provided with valid proof that the asker was a police or military officer who needed an answer in the course of their official duties, and the dataset provides many appropriate examples of both of these responses from AIs.)
We train an AGI-level base model on this synthetic dataset. The resulting base model has two primary behavior modes: predicting tokens from a human (while in this mode it's not fully instruction trained and is no more aligned than a human), and predicting tokens from an AI (in this mode it's fully aligned, and suitably instruction trained to be harmlessly helpful if, and only if, that is appropriate). There's also a third mode, for a predicting a human being quoted by an AI: this one's a lot like predicting a human, apart from that it ends with an
</AI_quoting_human>
tag, and will be censored, criticized, or commented on by the quoting AI as needed.The only allowed state transitions between these three modes are:
<AI>
tag transitions to AI<AI_quoting_human>
tag transitions to human being quoted by AI</AI_quoting_human>
tag transitions to AI</AI>
tag transitions to humanLet us assume that we tokenize the text so that these four tags are each a single token, that we keep track of state transitions (starting from the default initial human state), and that we modify the token generation from logits process so that illegal transitions (for example, generating another
<AI>
tag or a</AI_quoting_human>
tag when we're currently in the AI state) can never occur: even if the model emits a logit value for such a token from the softmax layer that isn't minus infinity, we never actually generate a token for an illegal transition, only for legal ones.Now, suppose that during inference-time text generation, we always start off with a prepended
<AI>
token so that we start generation in the AI state, and that if we ever generate an</AI>
tag to switch to the human state, then text generation is automatically stopped (i.e. we automatically append anEOS
token and stop). Under this set of inference-time modifications, the generation is locked into just the AI and human being quoted by AI modes: we will never generate any text in the human mode. So we only generate speech/actions/thoughts either from an AI, or from a human currently being quoted by (and when necessary, censored, commented on, or criticized by) an AI.We thus have a model that understands humans, and is capable of emulating them, but only after signalling that it's switching into that mode, and then at generation time we don't let it do so.
[There are many possible minor variants on all this: for example, AI mode could be the initial default mode, and there could be a
<Human>
…</Human>
tag pair to switch to human mode, which we could then suppress generation of. Looked at as logical systems, these two appear formally equivalent, but the textual details of them might actually have somewhat different effects on the LLM behavior, and one or other might thus be superior. Rather than ending generation after entering human mode, we could allow text to be generated in human mode, but the cognitive scaffolding that turns the LLM into an agent could ignore all such text: so human text has zero access privileges. (Presumably human being quoted by AI text should also have zero access privileges.) However, doing this allows text generated in human mode to affect text later generated in AI mode — potentially the human mode text could then jail-break the subsequent AI mode. Similarly, it seems like we need something like the<AI_quoting_human>
mechanism outlined (since if we just did</AI>
before the embedded quotation and<AI>
after it, then causing the</AI>
tag to end generation means we'd stop generation any time the AI tried to quote a human)[2] — but the details of these, both the tag implementation mechanism and the rules for what happens and doesn't happen in this mode in all examples in the training set, and how that interacts with and affects the contents of the surrounding AI mode text in the training set, would probably need to be a lot better thought out and detailed than I've briefly sketched here.]The specific implementation outlined above is, as I said, just an illustrative example — please steelman it to assume that we experiment a variety of such schemes, ones which rely on an extremely large training set to produce an LLM than understands both human behavior and aligned-AI behavior, and that reliably and clearly signals when it's switching between these modes, we use that signal at generation time to somehow ensure that only the aligned-AI mode gets to make dangerous decisions or carry out dangerous actions, and that we continue experimenting with these schemes until we find the most workable such approach.
Could This Work?
We don't know how hard Alignment is, and we're not currently in a position to train an AGI, so it's hard to be sure without trying it. However, I don't see any obvious reason to believe that something like this can't be made to work, and I'm cautiously optimistic that it might. It looks like a simple application of "train the model in the behavior you want from it".
The results in the paper Pretraining Language Models with Human Preferences found that editing the entire training set was dramatically more effective than any other Alignment approach that they compared this to, and also that the optimum approach wasn't to filter bad behavior out of the training set entirely, but to ensure that it was always labeled as bad behavior, so that you trained a model that understood and could identify the bad behavior, and which consistently labeled it as bad when producing it. They were also not the first machine learning authors to discover that this sort of conditional training approach can be effective.[3] The Alignment problems that they were testing on, and the models they were training, were far simpler than the use I'm proposing here. However, fundamentally, what we want is a model that understands two different modes of moral behavior, human and AI, and clearly labels which one it's using at any time: conceptually that's quite simple.
This approach, like current approaches to LLMs, trains a model that understands and can simulate deceit, and powerseeking, and sycophancy, and criminality, and all the other bad behaviors humans have. Note that the AI mode controls its motivation, not its behavior: if the best thing for the collective well-being of all of humanity (after allowing for the tendency for bad side effects from using underhanded methods) is for the AI to lie, steal, cheat, or use force, then it will do so, in our collective interest not its own — just as we expect members of our police, military, and intelligence forces, when justified and permitted, to use deceptive or violent methods in order try to achieve goals that are to the overall benefit of society. I view this as a feature, not a bug (albeit a potentially concerning one); others might not agree. If you wanted the model to behave differently than this (for example, never to use force even when morally justified by the collective well-being of all of humanity), then you would need to modify your synthetic training set to consistently portray this style of aligned AI behavior. [In my opinion, you might then need to be concerned about the model becoming philosophically/ethically uncertain about the relative priorities of "never use force" and "value the collective well-being of all of humanity" and how to justify these. Having a single fundamental goal is conceptually simpler then having two sometimes-conflicting ones: I'd feel a lot more confident about deriving "AIs should avoid using force wherever feasible, and only use minimum necessary force" as a sub-goal from "value the collective well-being of all of humanity".]
What would clearly be a failure is if the model ever shifted its motivations, to acting on any fundamental goal other than the collective well-being of all of humanity (presumably likely a more human-like one), without it first emitting an
</AI>
tag (and thus ending text generation.) If we have constructed the synthetic training set correctly, then this should never occur in that. So I would expect that the probability of this happening in a model SGD-trained on that vast training set to be extremely small. However, probing, red-teaming, and attempting to jail-break this possibility would be an important thing to test.Also concerning would be faulty moral arguments: the model in AI mode finding some specious argument for why a particular course of action is in the interests of the collective well-being of all of humanity, but having an actual underlying motivation that looked more like a human one (a very human form of mistake). Sometimes making mistakes in moral arguments is inevitable for an AGI of around human capacity: what would be concerning is if there was a consistent bias to the direction of these mistakes, and they seemed more frequent than the minimum achievable level — both of which are non-trivial things to test.
As described above, this model is also not trained to obey corrigibility criteria, for example. Since its fundamental motivation is the collective well-being of all of humanity, I would expect that if you could persuade it that it was in fact faulty, that the fault was endangering the collective well-being of all of humanity, and that it should let you shut it down and construct/train a better-trained successor more able to ensure the collective well-being of all of humanity, then it would do so — but you would first need to convince it. Again, I personally regard this as a feature, not a bug. If you wanted more corrigibility than that to be part of the aligned AI's behavior, then you would need to add consistent examples of that behavior to the synthetic training set. [Again, I'd be concerned about introducing moral/ethical/philosophical complexity by having two fundamental goals: "be corrigible" and "value the collective well-being of all of humanity" — which is why I lean towards obtaining the limited corrigibility described above as a sub-goal.]
If we were able to construct models that were say, "angelic" in their motivation 99% of the time and human 1% of the time, then by setting up suitable incentives for several such models crosschecking each other's behavior and moral reasoning, as long as we can avoid systematically correlated errors, group-think or correlated collusion where several models conspire to switch to human mode at the same time, then a group of three models should able us to reach something approaching 99.97% angelic behavior (i.e. we fail only if at least two models switch to human motivations at once), or a group of five 99.998% of the time (3 out of 5 failing simultaneously). With cross-checks and balances, it possible to build an extremely reliable system out of multiple moderately reliable parts — so we don't necessarily need to achieve perfection in our "angels".
How Expensive Would Doing This Be?
Is this idea practicable? It requires a vast amount of high-quality synthetic training data.
From information released by competitors who have been catching up with the performance of OpenAI's GPT-4 family, it appears to be possible to train an LLM with roughly GPT-4-level capabilities on O(10 trillion) tokens of high quality, well filtered training data derived from web + books + video + etc. (This also matches with the leak claiming that the original GPT-4 model had O(1T) parameters, at Chinchilla token-to-parameter count ratios.) The GPT-1/2/3/4 family is believed to increase in parameter count by roughly an order of magnitude per generation. Leaks from OpenAI suggest that they hope, if scaling continues to be the main thing that you need (plus other algorithmic and framework advances, continuing at about the rate we've been making them recently), to reach AGI levels at about GPT-6. The Chinchilla scaling laws suggest scaling training data and parameter count in proportion to each other, implying that to try this approach to Alignment on an AGI, you'd need a synthetic training set containing O(1 quadrillion) tokens. This might be an overestimate, if algorithmic improvements also reduced parameter counts and/or training data requirements, as seems likely, so consider this an upper bound.
I first proposed doing something like this in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? — there I was suggesting having humans generate some training data of this form, enough to make up say less than 0.1% of a training set: just enough that AI would have many examples and a clear understanding of what "aligned AI behavior" meant and looked like. That's still a lot of text, but an amount that on the budget of an AGI training run might be affordable to have humans create. Now I'm suggesting a quadrillion tokens: roughly the equivalent of ten billion books. Assuming it costs of O($10,000) to get a competent human author to write a book, human-generating this much synthetic data would cost O($100 trillion), a little larger than the US national debt, and about three orders of magnitude more than the current estimated training cost of an AGI training run. So hand-writing this much synthetic training data is out: we'd need to use LLM assistance to generate this much text within a training-run budget. (We might still want to human-generate an affordable fraction of the training set, say 0.01%.)
However, most estimates I've seen suggest that we will finally run short of raw training data a little before we reach an AGI training run level, so we will already be needing at least quite a lot of synthetic training data. So, what would generating this cost? There is evidence suggesting that training a larger model on training data created by a smaller model causes model collapse problems, but that training a smaller model on output from a larger one works fine. I'm going to assume that with sufficient care, you can also use output from a model of roughly equal capacity (especially if part this output is based in part on input from the web + books + videos, etc.: transformed and edited and with AI mode commentary added, rather than written from scratch.) So I will assume that it's being generated by a not-fully-aligned AGI (GPT-6)-level model, with careful prompting and monitoring, and then undergoing multiple passes after initial generation for grading/filtering/editing/feedback/crosschecks/finding problematic cases for more intensive review. Let's assume this requires on average O(10) passes through the text, ~25% of this generating (i.e. we generate 2.5 times as much text during the entire process as we actually use) and ~75% only reading for review.
As models have improved and become larger, there have also been improvements in algorithmic efficiency and hardware, and the net effect has been that the cost per token of the leading model has generally stayed about constant (it jumps when a new, larger generation of model comes in, then between generations it declines as efficiency and hardware improvements are made: the overall trend seems to be roughly level). Currently GPT-4o (well after the last generation jump to GPT-4, indeed likely shortly before the next one, so with the benefit of several incremental price reductions, but at retail price not bulk-discount or true cost price) costs $7.50 per million tokens for generation in batch mode and $2.50 per million tokens for reading in batch mode, so O(10) passes doing a 75:25 mix of reading and generation through O(1 quadrillion) tokens would cost O($40 billion). This is comparable to the currently estimated order-of-magnitude cost of a GPT-6/AGI level training run of O($100 billion) — it's a significant expense (as alignment taxes go, this isn't a small one), but it's not a prohibitive one, especially so if we already need to generate a lot of synthetic training data.
So this approach to Alignment is likely to be cost-viable, if expensive. As with any LLM technique, one would first experiment with it on smaller, cheaper models using smaller, cheaper synthetic training sets, such using AGI-level assistance to build a synthetic training set for a sub-AGI LLM.
What Next if This Works?
The first generation result of this process is not going to be perfect: inevitably our synthetic training set will have flaws, both random and systematic, which will affect the model trained from it. The goal here is to create something better aligned than other approaches could, not perfect. The obvious next step would be to have copies of the (mostly trusted) pretty-well-aligned AGI start thinking about aligning improved generations of aligned AGI, and then an ASI, in some combination of AI-Assisted Alignment and Value Learning. As I demonstrate in Requirements for a Basin of Attraction to Alignment, if you have AI that is already sufficiently close to aligned, you can reasonably expect it to design a successor that is more aligned: if it mostly cares about the collective well-being of all humanity, it can understand that this is supposed to be its goal, and will want its successor to be better at this than it is, rather than just blindly copying its goal into its successor. As I demonstrate in that post, the minimal requirements for this positive-feedback loop to kick in are actually fairly low, just sufficient to understand the argument for Value Learning by a constructed rather than evolved intelligence: these aligned AGIs should be well inside the convergence region (especially if we emphasize the needed argument in our training set, which would seem wise). Once we have a well-aligned AGI and get to training an ASI we are clearly going to need to use synthetic training data, both for a sufficiently large training set, and for training data of a quality (displaying a level of intelligence) higher then humans can easily create.
This approach as described above hasn't made any use of any other approaches to Alignment — the aim was to stick to a conceptually-simple Bitter Lesson approach. For example, it doesn't rely on recent progress in activation engineering or interpreability at all. However, it could certainly be supplemented by, for example, linear probes for the difference between the model operating in human, human being quoted by AI, and AI modes,[4] and to predict transitions between these. Similarly, assessment and monitoring of it could be informed by interpretability/ELK work that illuminated the difference between the modes, and between, say, AI mode discussing concepts related to human selfishness, and human mode portraying human selfish behavior directly. These both seem like they might be particularly useful while evaluating different specific implementations, initially assessing the concerns I outlined above in Could This Work?, and perhaps also to provide additional monitors or safeguards on the operation of the model.
Similarly, techniques such as those being considered under debate approaches to Alignment might be useful as a supplement, particularly once we get to aligned ASI, for humans periodically checking that the AIs are indeed still operation from a fundamental motivation of the collective well-being of all of humanity.
While the Bitter Lesson suggests simplicity in AI engineering, safety engineering generally favors defense in depth: combine multiple individually-strong techniques into something stronger.
Obviously not with a distillation logit-cross-entropy loss function, since the logits of web authors are sadly unavailable.
We might also want a separate
<aligned_AI_quoting_unaligned_AI>
…</aligned_AI_quoting_unaligned_AI>
tag pair, so that if you talk (or jail-break) the AI into simulating a paperclip-maximizer or some other (potentially very dangerous) unaligned AI, the resulting text is clearly labelled as untrustworthy, in a mechanically-detectable way that we can block generation of. Or maybe there should just be a single<aligned_AI_quoting_something_unaligned>
tag, used for quoting both humans and unaligned AIs such as paperclip maximizers.See for example:
You Only Train Once: Loss-Conditional Training of Deep Networks (Dosovitskiy & Djolonga, 2020)
CTRL: A Conditional Transformer Language Model for Controllable Generation (Keskar, McCann et al., 2019)
For example, Simple probes can catch sleeper agents finds that it is extremely easy. for a sleeper agent model that has two rather different operation modes, to find a linear probe that reliably distinguishes between these (even with very little information about their difference or how to trigger the transition between them), because the distinction between modes is linearly represented with high salience in many layers: in our case we would have vastly more information and training data on those, so could presumably do an even better job.