Reading this paper has significantly reduced my near-term P(DOOM), and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it. [I'm fully aware that many people on the Alignment Forum/Less Wrong disapprove of pointing out anything that helps capabilities, and generally I would agree — however, in this case fortunately the alignment and capabilities advantages are heavily and obviously entangled, and I see this as the sugar that may help the medicine go down, so I am going to briefly mention it.]
The Paper
Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them. They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).
The simplest behavior training approach they try is just prefiltering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to). Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard). The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like <good>…</good> and <bad>…</bad> tags. Then at inference time, start the response generation after a <good> tag, and during inference tweak the token generation process to ban the model from generating an </good> tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a <bad> tag (i.e. these are banned tokens, whose probability is reset to zero).
So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago — it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]
Applications to Alignment
So (assuming this carries over to larger LLMs and more abstract behaviors, as seems likely) that reduces the problem of (for example) "make an AI very unlikely to be deceitful (in ways that it learned from or recognizes as similar to human deceit)" to just "create an efficient high-accuracy classifier that can be used to scan and label the internet/pretraining dataset for sentences where the writer/speaker is either being deceitful, or advocating this (but not if they're warning about it)". Note that this is a lot easier than being sure that you can catch a superintelligent AI being sneakily deceitful: you only have to fairly-accurately catch humans doing it — even better, only past humans unable to (and generally uninterested in) adversarialy adapting to your classifier. (Though clearly you'd want to look quite carefully at a sample to determine what your deceitfulness-classifier's false negative rate looked like and what sorts of deceitful human behavior it was often missing.) This sounds like the sort of thing that one could prompt and/or fine-tune GPT-4 to be rather good at, and then distill that behavior down into to something cheaper to run. At worst, the cost of "scan your entire training set once with GPT-4" is already lower than the cost of "train GPT-4.0.1 on your entire training set", but to reduce the alignment tax it would be nice to make it significantly lower. Obviously you could also use a combination of an efficient high-recall preclassifier and a more expensive high-precision classifier. [They also mention in the paper that even labeling only a fraction of the dataset was still quite effective.]
This gives you a foundation model that knows in great detail what deceit is (to at least the accuracy of your classifier, plus any generalization that the LLM may have figured out), will automatically label it with <deceit>…</deceit> tags, and can be directly set never to be (knowingly) deceitful. Even more usefully, by monitoring what the predicted probability for the <deceit> tag was before you set it to zero, you can also monitor how tempted the model was to switch to being deceitful, on a token-by-token basis (including the effects of any jailbreaks that the model has being sent), which seems like it could be a really useful warning light.
Why LLMs are Challenging to Control
LLMs are pretrained as next-token predictors on large samples of the internet and similar pretraining datasets. Almost all tokens in these sources were generated by agents called humans (either solo, or a group of them working together to write and edit some text). So while LLMs are not themselves agentic, they learn to simulate human-like agents' token generation behaviors. In particular, they learn to simulate a wide distribution of human-like agentic mesaoptimizer behaviors of a wide distribution of authors and fictional characters, and at inference time they (gradually, as the character speaks and acts) pick something out of that distribution to simulate this time, depending on contextual cues like prompt text content and style, plus randomness. Predicting what agent they're going to pick and how that agent is then going to act is hard (especially if someone else is adversarially injecting jailbreak text into the LLM's context). So the inner alignment problem for an LLM isn't aligning just one mesaoptimizer, it's aligning a whole context-dependent distribution of them: controlling an LLM's behavior requires reliably controlling the behavior of the distributions of agents that it chooses to simulate across a wide range of circumstances — controlling/choosing between all the eyes on the shoggoth, not just putting a mask on two of them near the front.
Now, however, we just train it to understand what's allowed and what isn't, and then intervene directly at the token generation level, so that it can only simulate agents (mesaoptimizers) who will do allowed things. The LLM isn't agentic: it isn't fighting you or interested in deceiving you or power-seeking or anything like that; it's goal-agnostic in the sense of FAQ: What the heck is goal agnosticism?. It only 'wants' to accurately predict the next token, and if you intervene directly in that process so that it can't generate </good> or <bad> tokens, then no agent who (in the LLM's trained opinion) will try to do bad things will ever get simulated. So you don't need to worry about some badly-behaved simulated agent figuring out how to trick you: it never gets summoned in the first place (at inference time — badly-behaved agents were simulated at pretraining time). In the shoggoth metaphor, now all its eyes have been helpfully color-coded green or red, and we can tell it to keep all the red ones closed.
You can of course just prompt an LLM to act helpful and aligned (if the behavior you want can be accurately described in no more than thousands of tokens): but, by the Walugi effect there's a chance it will then morph the evil twin of what the prompt said it was, because that's a common plot device, and also because pretending to be something else is what liars do. (Except that, in well-classified pretraining text, they always do it inside <deceit> tags.) Also, anyone who has access to part of the prompt can jailbreak this by pushing in a different direction, as can non-adversarial events that just happen to occur in the prompt, and random chance. My intuition is that behavior that was consistently pretrained in should be stronger, its effect should last as long as you're banning an </good> tag, and that checking the probability for the </good> tag before it was set to zero lets you monitor if you're getting intentionally or accidentally jailbroken.
Adding Bells and Whistles
OK, let's extend this technique beyond what the authors did in the paper. Suppose that rather than just a simple binary classification of good and bad behaviors, we also had some behaviors that were undesirable (or at least concerning enough to be worth flagging) in some contexts, but OK in others. Classify these behaviors separately, and give them each different tags. For example, there are clearly corporate contexts where we don't want our LLM to generate anything that deserves to be inside <nsfw>…</nsfw> tags, but other circumstances where that may be acceptable (or perhaps even where that's what the user is currently looking for). So, classify and tag this behavior in your training dataset (rather than filtering it out), train your foundation model to understand the difference, and then at generation time, you can chose whether your model starts generating after a <sfw> tag and is banned from generating an <nsfw>, or otherwise. (Or you could even tell your model at inference time that it's currently only allowed to be NSFW, if that's your thing.) So you get a model that (at the cost of a few control tokens) can be controlled at inference time to behave in different ways, without needing any fine-tuning, and without any of the vagaries of prompting (though you should certainly also prompt it to do what you currently want). Very useful. You could also boost or penalize the logits/probabilities of control tags, rather than just enforcing or banning them, to give finer-grained control — for example, you could make the model simulate an agent who is unlikely to go there spontaneously but can still get <nsfw> in situations where that's very clearly invited. You can also change these controls dynamically on the fly during inference, using any logic you want. [Also, after adding a few more tags (such as political viewpoint, snarkiness, and such-like), and being appropriately more/less controlling in different contexts, such as fiction vs. a personal assistant, this would remove much of the current appeal of unaligned/differently-aligned open-source models, making AI governance a lot easier.]
Another set of tags that might be very useful would be emotions: tag any text emitted by, or describing the behavior of, someone under a significant emotion with the name of the emotion (under some useful ontology of human emotions: most estimates I have seen give somewhere in the range of 6 to 90 of them, depending on your level of hair-splitting). Then if you're generating fiction, allow these tags to be generated freely (or in some genres, fairly freely), whereas if you're implementing a customer service bot then, no matter what the customer says, the LLM is not allowed to generate an <angry> or <rude> tag (and we might want flag or terminate the conversation if it's even trying to do so with any significant probability).
Apart from <deceit>, behaviors with tags like <powerseeking>, <criminality>, and <psychopathy> also seem like they should be really useful things to be able to detect and block. Alignment now becomes a matter of building good classifiers for unaligned human behavior on the Internet. [Thus 4chan become a useful part of the pretraining dataset.] Short of an AI sufficiently approximately-Bayesian to be capable of value learning, aligned behavior from an agent is basically rational behavior when motivated only by combination of the emotions <love type="platonic" target="all of humanity">, <benevolence> and complete <selflessness>. Those are three more fairly-abstract classifiers, but it's pretty obvious where to start on them. Consistently staying in that specific combination of motivations is entirely out-of-distribution behavior for humans, as you'd expect from evolutionary theory. However, we're social animals, and almost all humans act pretty aligned with each other a lot of the time. For example, when I'm at work, my employers pay me to act aligned with the well-being of the company and its stock owners, and I do. Ignoring that little motivational detail of a paycheck, my behavior at work looks really aligned to the company. So, labeled at a per-sentence level, aligned behavior is really common from humans, even though basically no humans are actually well-aligned. What is out-of-distribution for a human agent is still acting that way for the benefit of total strangers, and when the human's own life is on the line. But that doesn't seem like it would be very difficult behavior for an LLM to extrapolate to, given a large training set suitably labeled with aspects of aligned behavior showing when humans are acting aligned, and when they stop. Basically, just don't stop, no matter what.
Generally-document-level contextual tags like <fiction>, <research>, <opinion> and <high_stakes> might also be useful. Behaviors that in a <high_stakes> context are concerning hallucinations are called 'creativity' in a <fiction> context. One might expect that an LLM could learn these distinctions itself, and then hope it would act appropriately based on prompting, but using tags to make things clearer and current expectations more definitive might also prove helpful, especially for resisting jailbreaking or random contents in the prompt.
Getting this alignment technique at least well past the "Don't Kill Everyone" minimally-aligned requirement seems quite promising to me, with a sufficiently capable LLM to identify and understand that sort of aligned behavior. It can of course be mixed-and matched with other alignment techniques to your taste in a swiss-cheese security approach. One very obvious extension would be to also run your classifiers on the model's output (but be cautious about RL-fine-tuning your model using this signal, to avoid an adversarial training regime that could encourage the model to learn how to fool them). Or possibly you could make use of some sort of adversarial GAN-like approach (with a generator model that you then don't deploy) to improve the classifier's robustness. Another approach would be to have the model retag a sample of its pretraining data (possibly after prompting or fine-tuning it, say in order to subtly adjust the definition of a tag), and then compare its tagging and tag-logits to what your classifiers did. It's clearly also very useful for any AGI/alignment system built on top of LLMs, such as scaffolded agents or graphs-of-thoughts. If token-tree-search techniques along the lines of Q* Search turn out to be important, it should combine well with them: the behavior tag tokens label tree branches containing the behavior, letting you prune them fast. Overall, I'm hopeful that this approach might extend basically up to the skilled-AGI level where we can automate alignment research and phase over into value learning. So that's why my P(DOOM) went down on reading this paper.
Returning to speculating about how this approach might combine with Interpretability, this gives us an extensible top-down, coarse-scale, behavioral means of detecting, monitoring and controlling what's going on in an LLM. Interpretability, if we can make it work, should give us something similar that is bottom-up, fine-scale, and mechanistic. The two ought to complement each other: interpretability features whose activation is strongly correlated with concerning behavior tags such as <deceit>, <powerseeking>, <criminality>, <anger> and so forth are obviously top priorities to investigate, and the effect of patching/scrubbing them on the logits of these tags should be very informative. Ideally, we can get the two approaches to meet in the middle.
Challenges
This is an approach to dealing with unaligned behavior of human-level and human-like agents simulated by an LLM that the LLM learnt from examples of unaligned human behavior, or at least can recognize as analogous. It probably doesn't extend fully to ASI-level agents capable of inventing ingenious new categories of unaligned behavior.
It requires us to enumerate categories of bad/concerning/good behavior, and do per-category work on them. So it assumes that unaligned/aligned human-like behavior can usefully be divided into a manageable number of categories.
It's based on classifiers: the classifier used to label the pretraining set, training the LLM to act as a classifier of its own emitted token-stream. So it shares all of the various, well understood challenges of machine-learning classifiers. For example, we know that in very-high-dimensional spaces like an LLM's residual embeddings, constructing a classifier highly robust against carefully adversarially-chosen examples is hard,[1] so making this approach proof against deliberate skilled jailbreaking is likely to be hard.
As written, it's measuring and blocking behaviors like <angry> as binary classifiers, trying to put sharp edges on fuzzy phenomena — though obviously one could easily extend this to quantized intensity bands like <angry intensity="somewhat"> if that was useful.
Since it involves work during pretraining, the cycle time for changing the system or experimenting with modifications is long, and involves a lot of computational cost. This could probably be ameliorated by performing experiments first using a fine-tuning approach, and then batching their full implementation into the re-pre-training cycles needed to keep an LLM's knowledge cutoff up-to-date.
Postscript
Finally, for any forum-readers who don't care about alignment, only about capabilities, at the beginning I promised you some sugar for your cooperation. The level of detailed, reliable, and flexible LLM behavior control that the above should give us is a fine and obviously-marketable start. Next, consider also labeling documents in your training set with classifier and/or metadata-derived tags describing the estimated IQ level and education level of their writer (or for fiction, the lower of the IQ of the writer and the character currently being described). Also, if a document or part of a document is something like a wikipedia article or a scientific paper or news article that has been through extensive editing and rewrites, tag that fact. Now consider the output from a really large LLM (maybe GPT-6 or 7) pretrained this way, when you start generation after the tags <iq band="145_160"><education level="postdoctoral"><edited>. Perhaps even, if the model can extrapolate numbers well, <iq band="175_180">.
How to do this is pretty well understood for image classifiers, see for example Adversarial Robustness - Theory and Practice. Doing the same for rather abstract semantic text classifiers is likely significantly harder (while the fundamental minimax structure of the adversarial problem remains, the space is completely non-continuous and the space of text perturbations retaining a similar semantic meaning has a vastly more complex structure: even just establishing a good semantic metric on it is challenging, though LLM residial embedding space seems like an obvious place to start).
This is a link-post for a paper I recently read: Pretraining Language Models with Human Preferences, followed by my reactions to this paper.
Reading this paper has significantly reduced my near-term P(DOOM), and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it. [I'm fully aware that many people on the Alignment Forum/Less Wrong disapprove of pointing out anything that helps capabilities, and generally I would agree — however, in this case fortunately the alignment and capabilities advantages are heavily and obviously entangled, and I see this as the sugar that may help the medicine go down, so I am going to briefly mention it.]
The Paper
Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them. They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).
The simplest behavior training approach they try is just prefiltering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to). Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard). The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like <good>…</good> and <bad>…</bad> tags. Then at inference time, start the response generation after a <good> tag, and during inference tweak the token generation process to ban the model from generating an </good> tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a <bad> tag (i.e. these are banned tokens, whose probability is reset to zero).
So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago — it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]
Applications to Alignment
So (assuming this carries over to larger LLMs and more abstract behaviors, as seems likely) that reduces the problem of (for example) "make an AI very unlikely to be deceitful (in ways that it learned from or recognizes as similar to human deceit)" to just "create an efficient high-accuracy classifier that can be used to scan and label the internet/pretraining dataset for sentences where the writer/speaker is either being deceitful, or advocating this (but not if they're warning about it)". Note that this is a lot easier than being sure that you can catch a superintelligent AI being sneakily deceitful: you only have to fairly-accurately catch humans doing it — even better, only past humans unable to (and generally uninterested in) adversarialy adapting to your classifier. (Though clearly you'd want to look quite carefully at a sample to determine what your deceitfulness-classifier's false negative rate looked like and what sorts of deceitful human behavior it was often missing.) This sounds like the sort of thing that one could prompt and/or fine-tune GPT-4 to be rather good at, and then distill that behavior down into to something cheaper to run. At worst, the cost of "scan your entire training set once with GPT-4" is already lower than the cost of "train GPT-4.0.1 on your entire training set", but to reduce the alignment tax it would be nice to make it significantly lower. Obviously you could also use a combination of an efficient high-recall preclassifier and a more expensive high-precision classifier. [They also mention in the paper that even labeling only a fraction of the dataset was still quite effective.]
This gives you a foundation model that knows in great detail what deceit is (to at least the accuracy of your classifier, plus any generalization that the LLM may have figured out), will automatically label it with <deceit>…</deceit> tags, and can be directly set never to be (knowingly) deceitful. Even more usefully, by monitoring what the predicted probability for the <deceit> tag was before you set it to zero, you can also monitor how tempted the model was to switch to being deceitful, on a token-by-token basis (including the effects of any jailbreaks that the model has being sent), which seems like it could be a really useful warning light.
Why LLMs are Challenging to Control
LLMs are pretrained as next-token predictors on large samples of the internet and similar pretraining datasets. Almost all tokens in these sources were generated by agents called humans (either solo, or a group of them working together to write and edit some text). So while LLMs are not themselves agentic, they learn to simulate human-like agents' token generation behaviors. In particular, they learn to simulate a wide distribution of human-like agentic mesaoptimizer behaviors of a wide distribution of authors and fictional characters, and at inference time they (gradually, as the character speaks and acts) pick something out of that distribution to simulate this time, depending on contextual cues like prompt text content and style, plus randomness. Predicting what agent they're going to pick and how that agent is then going to act is hard (especially if someone else is adversarially injecting jailbreak text into the LLM's context). So the inner alignment problem for an LLM isn't aligning just one mesaoptimizer, it's aligning a whole context-dependent distribution of them: controlling an LLM's behavior requires reliably controlling the behavior of the distributions of agents that it chooses to simulate across a wide range of circumstances — controlling/choosing between all the eyes on the shoggoth, not just putting a mask on two of them near the front.
Now, however, we just train it to understand what's allowed and what isn't, and then intervene directly at the token generation level, so that it can only simulate agents (mesaoptimizers) who will do allowed things. The LLM isn't agentic: it isn't fighting you or interested in deceiving you or power-seeking or anything like that; it's goal-agnostic in the sense of FAQ: What the heck is goal agnosticism?. It only 'wants' to accurately predict the next token, and if you intervene directly in that process so that it can't generate </good> or <bad> tokens, then no agent who (in the LLM's trained opinion) will try to do bad things will ever get simulated. So you don't need to worry about some badly-behaved simulated agent figuring out how to trick you: it never gets summoned in the first place (at inference time — badly-behaved agents were simulated at pretraining time). In the shoggoth metaphor, now all its eyes have been helpfully color-coded green or red, and we can tell it to keep all the red ones closed.
You can of course just prompt an LLM to act helpful and aligned (if the behavior you want can be accurately described in no more than thousands of tokens): but, by the Walugi effect there's a chance it will then morph the evil twin of what the prompt said it was, because that's a common plot device, and also because pretending to be something else is what liars do. (Except that, in well-classified pretraining text, they always do it inside <deceit> tags.) Also, anyone who has access to part of the prompt can jailbreak this by pushing in a different direction, as can non-adversarial events that just happen to occur in the prompt, and random chance. My intuition is that behavior that was consistently pretrained in should be stronger, its effect should last as long as you're banning an </good> tag, and that checking the probability for the </good> tag before it was set to zero lets you monitor if you're getting intentionally or accidentally jailbroken.
Adding Bells and Whistles
OK, let's extend this technique beyond what the authors did in the paper. Suppose that rather than just a simple binary classification of good and bad behaviors, we also had some behaviors that were undesirable (or at least concerning enough to be worth flagging) in some contexts, but OK in others. Classify these behaviors separately, and give them each different tags. For example, there are clearly corporate contexts where we don't want our LLM to generate anything that deserves to be inside <nsfw>…</nsfw> tags, but other circumstances where that may be acceptable (or perhaps even where that's what the user is currently looking for). So, classify and tag this behavior in your training dataset (rather than filtering it out), train your foundation model to understand the difference, and then at generation time, you can chose whether your model starts generating after a <sfw> tag and is banned from generating an <nsfw>, or otherwise. (Or you could even tell your model at inference time that it's currently only allowed to be NSFW, if that's your thing.) So you get a model that (at the cost of a few control tokens) can be controlled at inference time to behave in different ways, without needing any fine-tuning, and without any of the vagaries of prompting (though you should certainly also prompt it to do what you currently want). Very useful. You could also boost or penalize the logits/probabilities of control tags, rather than just enforcing or banning them, to give finer-grained control — for example, you could make the model simulate an agent who is unlikely to go there spontaneously but can still get <nsfw> in situations where that's very clearly invited. You can also change these controls dynamically on the fly during inference, using any logic you want. [Also, after adding a few more tags (such as political viewpoint, snarkiness, and such-like), and being appropriately more/less controlling in different contexts, such as fiction vs. a personal assistant, this would remove much of the current appeal of unaligned/differently-aligned open-source models, making AI governance a lot easier.]
Another set of tags that might be very useful would be emotions: tag any text emitted by, or describing the behavior of, someone under a significant emotion with the name of the emotion (under some useful ontology of human emotions: most estimates I have seen give somewhere in the range of 6 to 90 of them, depending on your level of hair-splitting). Then if you're generating fiction, allow these tags to be generated freely (or in some genres, fairly freely), whereas if you're implementing a customer service bot then, no matter what the customer says, the LLM is not allowed to generate an <angry> or <rude> tag (and we might want flag or terminate the conversation if it's even trying to do so with any significant probability).
Apart from <deceit>, behaviors with tags like <powerseeking>, <criminality>, and <psychopathy> also seem like they should be really useful things to be able to detect and block. Alignment now becomes a matter of building good classifiers for unaligned human behavior on the Internet. [Thus 4chan become a useful part of the pretraining dataset.] Short of an AI sufficiently approximately-Bayesian to be capable of value learning, aligned behavior from an agent is basically rational behavior when motivated only by combination of the emotions <love type="platonic" target="all of humanity">, <benevolence> and complete <selflessness>. Those are three more fairly-abstract classifiers, but it's pretty obvious where to start on them. Consistently staying in that specific combination of motivations is entirely out-of-distribution behavior for humans, as you'd expect from evolutionary theory. However, we're social animals, and almost all humans act pretty aligned with each other a lot of the time. For example, when I'm at work, my employers pay me to act aligned with the well-being of the company and its stock owners, and I do. Ignoring that little motivational detail of a paycheck, my behavior at work looks really aligned to the company. So, labeled at a per-sentence level, aligned behavior is really common from humans, even though basically no humans are actually well-aligned. What is out-of-distribution for a human agent is still acting that way for the benefit of total strangers, and when the human's own life is on the line. But that doesn't seem like it would be very difficult behavior for an LLM to extrapolate to, given a large training set suitably labeled with aspects of aligned behavior showing when humans are acting aligned, and when they stop. Basically, just don't stop, no matter what.
Generally-document-level contextual tags like <fiction>, <research>, <opinion> and <high_stakes> might also be useful. Behaviors that in a <high_stakes> context are concerning hallucinations are called 'creativity' in a <fiction> context. One might expect that an LLM could learn these distinctions itself, and then hope it would act appropriately based on prompting, but using tags to make things clearer and current expectations more definitive might also prove helpful, especially for resisting jailbreaking or random contents in the prompt.
Getting this alignment technique at least well past the "Don't Kill Everyone" minimally-aligned requirement seems quite promising to me, with a sufficiently capable LLM to identify and understand that sort of aligned behavior. It can of course be mixed-and matched with other alignment techniques to your taste in a swiss-cheese security approach. One very obvious extension would be to also run your classifiers on the model's output (but be cautious about RL-fine-tuning your model using this signal, to avoid an adversarial training regime that could encourage the model to learn how to fool them). Or possibly you could make use of some sort of adversarial GAN-like approach (with a generator model that you then don't deploy) to improve the classifier's robustness. Another approach would be to have the model retag a sample of its pretraining data (possibly after prompting or fine-tuning it, say in order to subtly adjust the definition of a tag), and then compare its tagging and tag-logits to what your classifiers did. It's clearly also very useful for any AGI/alignment system built on top of LLMs, such as scaffolded agents or graphs-of-thoughts. If token-tree-search techniques along the lines of Q* Search turn out to be important, it should combine well with them: the behavior tag tokens label tree branches containing the behavior, letting you prune them fast. Overall, I'm hopeful that this approach might extend basically up to the skilled-AGI level where we can automate alignment research and phase over into value learning. So that's why my P(DOOM) went down on reading this paper.
Returning to speculating about how this approach might combine with Interpretability, this gives us an extensible top-down, coarse-scale, behavioral means of detecting, monitoring and controlling what's going on in an LLM. Interpretability, if we can make it work, should give us something similar that is bottom-up, fine-scale, and mechanistic. The two ought to complement each other: interpretability features whose activation is strongly correlated with concerning behavior tags such as <deceit>, <powerseeking>, <criminality>, <anger> and so forth are obviously top priorities to investigate, and the effect of patching/scrubbing them on the logits of these tags should be very informative. Ideally, we can get the two approaches to meet in the middle.
Challenges
Postscript
Finally, for any forum-readers who don't care about alignment, only about capabilities, at the beginning I promised you some sugar for your cooperation. The level of detailed, reliable, and flexible LLM behavior control that the above should give us is a fine and obviously-marketable start. Next, consider also labeling documents in your training set with classifier and/or metadata-derived tags describing the estimated IQ level and education level of their writer (or for fiction, the lower of the IQ of the writer and the character currently being described). Also, if a document or part of a document is something like a wikipedia article or a scientific paper or news article that has been through extensive editing and rewrites, tag that fact. Now consider the output from a really large LLM (maybe GPT-6 or 7) pretrained this way, when you start generation after the tags <iq band="145_160"><education level="postdoctoral"><edited>. Perhaps even, if the model can extrapolate numbers well, <iq band="175_180">.
How to do this is pretty well understood for image classifiers, see for example Adversarial Robustness - Theory and Practice. Doing the same for rather abstract semantic text classifiers is likely significantly harder (while the fundamental minimax structure of the adversarial problem remains, the space is completely non-continuous and the space of text perturbations retaining a similar semantic meaning has a vastly more complex structure: even just establishing a good semantic metric on it is challenging, though LLM residial embedding space seems like an obvious place to start).