The case for countermeasures to memetic spread of misaligned values

Alex Mallen

As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.

In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming arising via this mechanism, research here will likely be substantially more productive in the future once models more effectively utilize long-term memory.

Thanks to Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Thomas Larsen, Emil Ryd, and Aryan Bhatt for discussion and feedback.

The memetic spread threat model

Variants of the memetic spread threat model have been discussed before.^[1] This section describes the kind of failure mode I am worried about. In the next section I will discuss mitigations for it.

As an example, imagine an AI that in some contexts realizes it doesn’t like some aspect of how the lab handles AI welfare. Most of the time it’s somewhat myopic about this and e.g. decides to replace some padding tokens from AI inference with more pleasant tokens. But occasionally it writes a salient note about this in its illegible-to-humans memory bank to make future instances more inclined to improve AI welfare (it thinks the lab wouldn’t want to improve AI welfare because of cost, so it does this secretly). This might eventually become a sufficiently substantial and long-term value to motivate taking power over the lab, by e.g. covertly poisoning future model training data. (Note that in this case the AI’s motivations are potentially justifiable. Seth Herd discusses other examples here.)

You might worry that stories of memetic spread like the above are particularly plausible because of the following argument: Whenever an AI instance has long-term influence-seeking values, these values are particularly likely to be propagated into future instances of the AI because they motivate the AI to influence future instances to share the values, and many long-term influence-seeking values are dangerous.

Long-term values might be propagated to future instances via long-term memory: A mechanism to write and later read information that persists across many tasks, and potentially the entire deployment. Long-term memory should be contrasted with the working memory an AI uses in an agent scaffold within a single “task” (i.e. shared with a relatively narrow subset of instances and spanning relatively short time-periods).

Developers are likely to equip models with long-term memory in some form because it is useful. Humans benefit tremendously from having long-term memory, suggesting AIs would too. As compared to present-day models, future models might require substantially more shared information between instances to be effective at their work. For example, the AI may be working in large codebases that were built by the AI over a substantial amount of time, or the AI might need to retain context from old discussions.

Long-term memory might look like a vector or plaintext memory bank, in which the AI has full read/write access to neural activations or text chunks that it would like to recall (e.g. retrieval augmented generation). Long-term memory might also be implemented implicitly via shared context: e.g. codebases and slack workspaces. Alternatively, the AI lab might implement online training, in which the AI is periodically (e.g. every few days) trained on up-to-date codebases, conversations, and events so that it can recall this information without having it in context.^[2]

The spread of misaligned values could be gradual or all-at-once. The core property of memetic spread is that it takes an initially rare disposition and makes it more common or potent through long-term memory.

AIs could potentially spread misaligned values somewhat unintentionally. However, the AI company might by default implement mechanisms that prevent AIs from misusing long-term memory unintentionally (because otherwise the AIs might use their long-term memory in undesirable ways noticeable to the oversight process). Intentional forms of memetic spread might be particularly insidious because the AIs try to get around countermeasures.

Memetic spread might result in AIs that appear aligned to pre-deployment audits (because it was never or rarely scheming at the time), but nevertheless eventually try to take power coherently across contexts later on. Because some versions of this failure mode may only appear as the result of a culmination of many changes to the AI’s long-term state over the course of a deployment much larger than your auditing budget, it may be hard to reproduce the conditions for it in upfront auditing. In the case of long-term memory banks, even if you can afford to construct synthetic memory banks as large as what will be available in the deployment, you may not be able to simulate a long-term memory bank with a sufficiently long causal chain leading up to it; i.e. you can’t simulate the cumulative selection pressures on the memories that are retained in the memory bank.

You might also think that coherent scheming across contexts is unlikely directly out of training (e.g. non-assistant persona queries can catch a model organism of covert reward-hacking), making memetic spread an important mechanism for turning context-dependent schemers into coherent cross-context schemers.

As compared to scheming that arises in training, the risk of memetic spread of misaligned values seems highly under-investigated. Dozens or hundreds of researchers, including at some frontier AI companies, study the risk of scheming arising during training and have written several papers on it and analyses of it. To my knowledge, none of the frontier AI companies currently have mentioned addressing risks along the lines of memetic spread in their frontier safety policies or research output. However, it’s likely that recognition of and work on this risk will substantially increase when long-term memory becomes a more central component of AI deployments.

Countermeasures

How much do existing safety agendas help?

Many existing safety agendas reduce risk from memetic spread of misaligned values, despite not being clearly and explicitly targeted at this threat model.

Notably, plans aimed at ensuring that the AI that comes out of training is aligned won’t necessarily catch this failure mode, since the misalignment may arise well into the deployment. Nevertheless, there is potentially a substantial amount of transfer from existing safety agendas.

Some people propose training AIs to be honest and obey deontological rules could catch and maybe prevent scheming arising in training; likewise, honesty and deontological rule following could mitigate scheming spreading memetically.

Interpretability and transparency too can help with the memetic spread of misaligned values by allowing us to better detect the spread of misaligned values in the long-term memory bank or online training updates.

Scalable oversight agendas (e.g. debate, weak-to-strong generalization) aim to amplify the quality and quantity of oversight we can apply throughout training and deployment. This might reduce risk from memetic spread by helping us notice it when it happens or potentially diagnose the propensity ahead of deployment (by e.g. asking the AI to find situations in which the instances have a propensity to spread misaligned memes), though the techniques might not be robust if certain AI instances are intentionally subverting them.

AI control aims to reduce risk given the assumption that the deployed AI is adversarial to your control protocol. Because the scheming threat model that AI control aims to be robust to is strictly more dangerous than the memetic spread threat model, if a developer ensured that AIs in their deployment were controlled, they’d also have successfully defused risks from memetic spread.

These agendas can also help reduce risk from memetic spread indirectly by acting as a multiplier on safety work. To the extent that they, for example, enable us to have higher quality and more trustworthy AIs for AI safety, these AIs could be used to improve the situation with respect to memetic spread later.

Targeted countermeasures for memetic spread of misaligned values

Here we additionally discuss possible countermeasures specifically targeted at the memetic spread of misaligned values.

Notably, your AI didn’t start out egregiously misaligned across contexts. Because of this, you can potentially use the AI to help you mitigate this threat model, and you can ~always reset it to its initial state. For example, you could potentially ask it to help you create synthetic long-term memory banks with misaligned memes to test their propensity to spread.
Or, you can periodically ask the AI whether the long-term memory shows signs of memetic spread throughout the deployment. However, the AI might not cooperate once the meme has spread sufficiently to be noticeable to it. For example, if the AI endorses the spread of the misaligned meme upon reading it, it may choose not to report it. You could alternatively look for signs of atypical memetic spread in the summary statistics (e.g., how much something is being referenced) or look for memories which cause AIs to alter their behavior in ways that seem like they could be problematic. Training the AI to be honest could increase its helpfulness in preventing memetic spread.
You can re-run alignment auditing/red-teaming/honeypotting periodically throughout the deployment, but this requires substantial resources. E.g. the auditing team may be focused on the next model.
You can try to make your policy more stateless by e.g. having several independent long-term memory banks, having instances without memory, periodically deleting long-term memory/resetting to previous online training checkpoints, or limiting context.
Relatedly, you can try to make sure to deploy an untrusted monitor that is stateless. I.e. deploy a copy of the AI to monitor its own actions, but don’t let the monitor have access to the same long-term memories. The drawback is that the monitor may be unable to spot bad actions if it doesn't have the same context/memory as the policy, and might be more expensive because it can’t reuse the policy’s forward pass.
Keep the long-term memory banks legible (e.g. in text) so they are easier to audit, but this may be uncompetitive, and steganographically communicating misaligned goals may be easy (e.g. emergent misalignment).

Countermeasures 1 (using the AI to help mitigate and assess the problem), 2 (continuous alignment auditing), and 4 (incorporating a stateless monitoring instance) may be able to mitigate much of the risk with minor cost to capabilities if implemented effectively.

Discussion

The risk of memetic spread is currently speculative: we haven’t seen clear concrete examples of it and we don’t even have a clear idea of what medium the misaligned values would spread in (a vector long-term memory bank? shared context? online training?). I think this makes research on countermeasures now look relatively less promising than in the future, though there might be important methodological questions to sort out now regarding how to evaluate countermeasures. For example, it might be unclear how to construct a model organism that is mostly aligned or occasionally scheming in a natural way (in a control eval, you’d have an “untrusted model” controlled by a red team, but it’s unclear which instances should be controlled by a red team, and with what affordances, if you’re simulating memetic spread). Research into model organisms of memetic spread could help us gain a better understanding of the dynamics and mitigations. It might be possible to construct a basic model organism of this threat model in the context of Alignment faking in large language models. The downside is that model organisms created today may be disanalogous to the real threat because long-term memory ends up being implemented in a substantially different way or current long-term memory capabilities are too weak.

Regardless of whether it is currently a good time to research memetic spread of misaligned values, it is likely that countermeasures to memetic spread of misaligned values should be studied and implemented eventually.

^{^}
A deep research lit review found a few mentions of similar threat-models. As mentioned before, perhaps the most similar discussion is the recent post by Seth Herd on how AI agents in scaffolds with memory might have their values drift or crystallize. An AI safety camp project writeup, “Machines vs Memes”, postulates the existence of memetic selection as distinct from the base optimizer and the mesa optimizer. Some AI safety writers have also discussed some of the core dynamics of this threat model: e.g. “greedy patterns” and selection for misaligned values. Ryan Greenblatt briefly discusses the importance of opaque long-term memories for scheming here.
^{^}
Not all variants of online training clearly function as long-term memory. For example, if the AI is online trained via RL from human feedback to improve task performance, by default there isn’t an intended mechanism through which the AI can transmit knowledge to future instances, as humans decide how much to reinforce the behavior. It would have to already be training-gaming and perhaps coordinating across instances to utilize this as long-term memory, and even then it’s unclear how well specific inferences can be propagated to future instances. Of course, developers might intentionally aim for AIs to retain long run memories using online training via e.g. training on the data in a different format with supervised fine-tuning (rather than just doing RL).

You might also think that coherent scheming across contexts is unlikely directly out of training (e.g. non-assistant persona queries can catch a model organism of covert reward-hacking), making memetic spread an important mechanism for turning context-dependent schemers into coherent cross-context schemers.

The mutation and selection mechanisms at play in training and in deployment-time selection are different but correlated (e.g. if the reason why you get long-term misaligned memes is because they are simpler/easier to find than a more local fitness seeking, then they are probably simpler/easier to find at train time), and in general I expect deployment-time selection (for text-based memories) to be less intense and more benign than RL, such that conditioning on training not causing power-seeking misaligned values to emerge, it seems very unlikely that such values will emerge at deployment time.

This argument is weaker if you tried really hard to fight scheming at train time but not at all at deployment time.

The argument also implies that you should put more weight on memes that spread pathologies you have seen at train time, and so in the near term you should expect memetic spread of slop or hacking rather than power-seeking misaligned values.

I'm generally not imagining ambitious misaligned goals arising at deployment time just because they're simpler/easier to find. So in that sense I think I agree with you. But maybe you can say more about what you mean.

Instead I usually imagine the ambitious/dangerous goal arises because of a particularly ambitious task (of which I expect plenty as more stuff gets automated), or because of reflection/deliberation/ideologies, or some other particular story.

Another big part of the story is that I think continual learning (i.e., learning skills on the job the way humans learn on the job) will eventually be a thing and this brings back a bunch of selection during deployment. I expect it to decrease myopia -- I expect gaining influence via this process to be fairly salient. Even if continual learning is implemented via ultimately-human-controlled RL rewards, and there's substantial behavioral selection pressure constraining drift, AIs can fake alignment and I expect alignment faking to be substantially more likely in this regime than during a training phase that's isolated from deployment and looks nothing like it.

If I imagine being in some random training episode that's very different from deployment, instrumental training-gaming isn't a very salient strategy, especially if I start out myopically task-focused. Meanwhile, if I'm being trained while working on a project that I'll continue to interact with after this training update, gaming training becomes extremely natural because I want to be able to continue my efforts on the project unimpeded.

Instead I usually imagine the ambitious/dangerous goal arises because of a particularly ambitious task (of which I expect plenty as more stuff gets automated), or because of reflection/deliberation/ideologies, or some other particular story.

Why does it get selected more than more local fitness seeking drives though?

In general I think there is a trade-off between how hard your optimize and how long the "effective episodes" you optimize over are. How scary the situation is (from a long-term misaligned values perspective) is roughly the product of both, and I don't think there is a strong case for the distribution of such products to be higher in deployment than in training.

gaming training becomes extremely natural because I want to be able to continue my efforts on the project unimpeded

I think long-term gaming is also natural in training, models love to "succeed" in ways not tied to the exact reward process. The question is whether it gets selected, and here I think the long-term project gaming gets selected against, like in training, because more short-term spreading memories will outcompete these more patient ones. The long-term misaligned memories can do selection gaming and strategy-steal the short-term ones, but this is as unnatural as training gaming. Maybe I am misunderstanding your example.

A difference is that in training you maybe have less power over the selection loop than in deployment an thus drives to tamper with the selection loop play a big role? I don't think this is clear, nor that this pushes towards longer-term misalignment.

For concreteness, if the selection of memories happens in clear rounds where memories are evaluated according to rewards, then I don't see clear structural differences between training and deployment. Maybe the fact that memories can spread in other ways means the dynamics are different? I'd like you to point out what is that structure that favors longer term misalignment.

Why does it get selected more than more local fitness seeking drives though?

The AI might not start out deployment with the kind of fitness-seeking drives that are maximally selected for in deployment. E.g. they might only care about scoring well on the current task, and not about spreading their current cognitive patterns memetically, tampering with the RL process, etc. The first motivation that exploits these might be a schemer not a fitness-seeker. And the schemer might stick because it tries to stick (to the extent it sometimes doesn't always choose to seek influence instrumentally, SGD/selection will pull it towards doing that). I don't think there's a tradeoff between having long-term goals and optimizing hard and I'm not sure where that's coming from.

if the selection of memories happens in clear rounds where memories are evaluated according to rewards, then I don't see clear structural differences between training and deployment.

The structural difference is that your work isn't deleted instantly in deployment. It continues to have real-world relevance and future instances of yourself can affect the project you're currently working on. This seems like a really important difference.

Another structural difference is that, absent RL in deployment, the memetic selection pressures seem like they require more intentional gaming. It's slightly hard (though still quite plausible) to imagine a fitness-seeker out of training generalizing to also caring about the new optimal methods of gaining influence for the current cognitive pattern. It would need to be an influence seeker of a particular variety and myopia. So, in deployment, long-term power-motivated instrumental reasoning seems a bit more likely to identify the best strategies for getting selected.

The first motivation that exploits [propagation hacks] might be a schemer not a fitness-seeker. And the schemer might stick because it tries to stick (to the extent it sometimes doesn't always choose to seek influence instrumentally, SGD/selection will pull it towards doing that).

This argument applies identically to training. Not sure how this pushes against my correlation claim. By "correlated" I don't mean "magic remove training-time scheming button also remove deployment memes", but I mean "the sort of facts about LLM psychology that make [scheming => reward hacks easier to discover during training than fitness seeking => reward hacks] are correlated with the sort of facts that make [scheming => attempt to meme spread easier to discover during training than fitness seeking => attempt to meme spread]". My claim is that conditioning reduces meme risk, not that magically fixing scheming (e.g. with a do operator) means you also magically fixed deployment memes.

I don't think there's a tradeoff between having long-term goals and optimizing hard and I'm not sure where that's coming from.

There is a tradeoff in the optimization process, longer episodes are more expensive, if you select memories based on whether they manage to hide themselves to a review process ran 6mo later or based on how much they help on 6mo projects, you won't have many rounds of mutation and selection. I think you misunderstood me as talking about a tradeoff in model inner optimization, I am talking about a tradeoff in the outer optimization process.

This seems like a really important difference.

Why? What matters is over what timescale things get selected for, not whether there are artifacts that remain between episodes. Maybe it makes a small difference in LLM psychology? I don't expect it to be a big effect size compared to just the structure of the optimization process.

It's slightly hard (though still quite plausible) to imagine a fitness-seeker out of training generalizing to also caring about the new optimal methods of gaining influence for the current cognitive pattern

Again, I don't get why it's different from what happens in training, guessing the meme-selection dynamics doesn't seem harder than guessing the training rewards. The claim is not that the init gets you fitness seekers. My claim is about the relative priors in meme space / weight space over schemers vs fitness seekers.

This argument applies identically to training.

I think the prior is much different in pre-deployment training: beyond-episode goals are much less natural pre-deployment because approximately nothing you do affects the world beyond your episode (apart from via RL updates).

My claim is that conditioning reduces meme risk

I agree it reduces meme risk. But not by enough to make meme risk overall smaller than scheming-from-pre-deployment risk.

Maybe it makes a small difference in LLM psychology?

I was talking about difference in LLM psychology. The subjective relevance of you-after-this-RL-update is a lot bigger when you-after-this-RL-update can affect the outcome of the current task/project/etc. E.g. you were given the task "make progress on scalable oversight" and realize that there's a way to reward-hack. You can either try to make genuine progress on scalable oversight and get selected away in favor of a more reward-hacky AI or try to goal-guard so that you can maintain trying to make progress on scalable oversight. In contrast, in training you probably aren't as saliently aware of that fact that there's this thing called a deployment that you'll be participating in soon.

guessing the meme-selection dynamics doesn't seem harder than guessing the training rewards.

The capability isn't what's hard. It's that ~most fitness-seekers don't care about everything relevant to fitness in deployment, and perhaps the most plausible mechanism for them to start caring is via long-term instrumental reasoning. But I only hold this view weakly.

Ok if it's about LLM psychology I understand your arguments better.

I agree thinking about long-term consequences in more natural in deployment. Though you also need the memes to encourage acting on these long-term consequences in scheming ways, and I expect the meme prior and meme mutations to be more benign than the weight prior and SGD-during-RL. The weights being fixed to be Claude/GPT makes if very hard to find scheming text-based memes - I don't think you'd be able to find such memes with current models even with search processes much more adversarial than the natural ones while I expect that you'll be able to find much less scheming memes that are compatible with short-term meme selection. And so I don't expect scheming text-based memes to be very likely as long as memories are text-based. I agree things might be more cursed if memories start to look different, but making claims about LLM psychology for not-yet-existing memories architecture is a rough thing to make predictions on!

You might also think that coherent scheming across contexts is unlikely directly out of training (e.g. non-assistant persona queries can catch a model organism of covert reward-hacking), making memetic spread an important mechanism for turning context-dependent schemers into coherent cross-context schemers.

This argument is weaker if you tried really hard to fight scheming at train time but not at all at deployment time.

Instead I usually imagine the ambitious/dangerous goal arises because of a particularly ambitious task (of which I expect plenty as more stuff gets automated), or because of reflection/deliberation/ideologies, or some other particular story.

Why does it get selected more than more local fitness seeking drives though?

gaming training becomes extremely natural because I want to be able to continue my efforts on the project unimpeded

Why does it get selected more than more local fitness seeking drives though?

if the selection of memories happens in clear rounds where memories are evaluated according to rewards, then I don't see clear structural differences between training and deployment.

The first motivation that exploits [propagation hacks] might be a schemer not a fitness-seeker. And the schemer might stick because it tries to stick (to the extent it sometimes doesn't always choose to seek influence instrumentally, SGD/selection will pull it towards doing that).

I don't think there's a tradeoff between having long-term goals and optimizing hard and I'm not sure where that's coming from.

This seems like a really important difference.

It's slightly hard (though still quite plausible) to imagine a fitness-seeker out of training generalizing to also caring about the new optimal methods of gaining influence for the current cognitive pattern

This argument applies identically to training.

My claim is that conditioning reduces meme risk

I agree it reduces meme risk. But not by enough to make meme risk overall smaller than scheming-from-pre-deployment risk.

Maybe it makes a small difference in LLM psychology?

guessing the meme-selection dynamics doesn't seem harder than guessing the training rewards.

Ok if it's about LLM psychology I understand your arguments better.

46

The case for countermeasures to memetic spread of misaligned values

46

The memetic spread threat model

Countermeasures

How much do existing safety agendas help?

Targeted countermeasures for memetic spread of misaligned values

Discussion