Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout

AI ALIGNMENT FORUM
AF

Self-fulfilling misalignment data might be poisoning our AI models — AI Alignment Forum

51 Self-fulfilling misalignment data might be poisoning our AI models

by TurnTrout

2nd Mar 2025

1 min read

51

This is a linkpost for https://turntrout.com/self-fulfilling-misalignment

Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.

Research I want to see
Each of the following experiments assumes positive signals from the previous ones:
Create a dataset and use it to measure existing models
Compare mitigations at a small scale
An industry lab running large-scale mitigations
Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.

https://turntrout.com/self-fulfilling-misalignment

Self Fulfilling/Refuting PropheciesAlignment PretrainingLanguage Models (LLMs)Machine Learning (ML)AI

Frontpage

Self-fulfilling misalignment data might be poisoning our AI models

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:41 AM

[-]Daniel Kokotajlo1y2419

I agree with the claims made in this post, but I'd feel a lot better about it if you added some prominent disclaimer along the lines of "While shaping priors/expectations of LLM-based AIs may turn out to be a powerful tool to shape their motivations and other alignment properties, and therefore we should experiment with scrubbing 'doomy' text etc., this does not mean people should not have produced that text in the first place. We should not assume that AIs will be aligned if only we believe hard enough that they will be; it is important that people be able to openly discuss ways in which they could be misaligned. The point to intervene is in the AIs, not in the human discourse."

[-]TurnTrout1y311

This suggestion is too much defensive writing for my taste. Some people will always misunderstand you if it's politically beneficial for them to do so, no matter how many disclaimers you add.

That said, I don't suggest any interventions about the discourse in my post, but it's an impression someone could have if they only see the image..? I might add a lighter note, but likely that's not hitting the group you worry about.

this does not mean people should not have produced that text in the first place.

That's an empirical question. Normal sociohazard rules apply. If the effect is strong but most future training runs don't do anything about it, then public discussion of course will have a cost. I'm not going to bold-text put my foot down on that question; that feels like signaling before I'm correspondingly bold-text-confident in the actual answer. Though yes, I would guess that AI risk worth talking about.^[1]

^{^}
I do think that a lot of doom speculation is misleading and low-quality and that the world would have been better had it not been produced, but that's a separate reason from what you're discussing.

[-]TurnTrout1y1415

I'm adding the following disclaimer:

> [!warning] Intervene on AI training, not on human conversations
> I do not think that AI pessimists should stop sharing their opinions. I also don't think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus.

[-]Daniel Kokotajlo1y27

yay, thanks! It means a lot to me because I expect some people to use your ideas as a sort of cheap rhetorical cudgel "Oh those silly doomers, speculating about AIs being evil. You know what the real problem is? Their silly speculations!"

[-]Raemon1y*63

My current guess is:

1. This is more relevant for up-to-the first couple generations of "just barely superintelligent" AIs.

2. I don't really expect it to be the deciding factor after many iterations of end-to-end RSI that gets you to the "able to generate novel scientific or engineering insights much faster than a human or institution could."

I do think it's plausible that the initial bias towards "evil/hackery AI" could start it off in a bad basin of attraction, but a) even if you completely avoided that, I would still basically expect this to rediscover this on it's own as it gained superhuman levels of competence, b) one of the things I most want to use a slightly-superhuman AI to do is to robustly align massively superhuman AI, and I don't really see how to do that without directly engaging with the knowledge of the failure modes there.

I think there are other plans that route more though "use STEM AI to build an uploader or bioenhancer, and then have an accelerated human-psyche do the technical philosophy necessary to handle the unbounded alignment case. I could see that being the right call, and I could imagine the bias from the "already knows about deceptive alignment etc" being large-magnitude enough to matter in the initial process. [edit: In those cases I'd probably want to filter out a lot more than just "unfriendly AI strategies"]

But, basically, how this applies depends on what it is you're trying to do with the AI, and what stage/flavor of AI you're working with and how it's helping.

[-]David James1y0-6

and I don't really see how to do that without directly engaging with the knowledge of the failure modes there.

I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?

[-]TurnTrout11mo67

I agree. To put it another way, even if all training data was scrubbed of all flavors of deception, how could ignorance of it be durable?

This (and @Raemon 's comment^[1]) misunderstand the article. It doesn't matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying "AI deceives" might make the AI actually deceive (by activating those circuits more strongly, for example). It's possible that "in context learning" might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.

From the article:

We are not quite “hiding” information from the model

Some worry that a “sufficiently smart” model would “figure out” that e.g. we filtered out data about e.g. Nick Bostrom’s Superintelligence. Sure. Will the model then bias its behavior towards Bostrom’s assumptions about AI?
I don’t know. I suspect not. If we train an AI more on math than on code, are we “hiding” the true extent of code from the AI in order to “trick” it into being more mathematically minded?
Let’s turn to reality for recourse. We can test the effect of including e.g. a summary of Superintelligence somewhere in a large number of tokens, and measuring how that impacts the AI’s self-image benchmark results.

^{^}
"even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it's own"

[-]Raemon10mo1-6

I think I understood your article, and was describing which points/implications seemed important.

I think we probably agree on predictions for nearterm models (i.e. that including this training data makes it more likely for them to deceive), I just don't think it matters very much if sub-human-intelligence AIs deceive.

[-]Knight Lee1y65

See also: Training on Documents About Reward Hacking Induces Reward Hacking, a study by Anthropic.

[-]aggliu1y11

I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.

[-]TurnTrout11mo32

Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

[-]Daniel Tan1y00

Something you didn't mention, but which I think would be a good idea, would be to create a scientific proof of concept of this happening. E.g. take a smallish LLM that is not doom-y, finetune it on doom-y documents like Eliezer Yudkowsky writing or short stories about misaligned AI, and then see whether it became more misaligned.

I think this would be easier to do, and you could still test methods like conditional pretraining and gradient routing.

[-]TurnTrout11mo40

I think we have quite similar evidence already. I'm more interested in moving from "document finetuning" to "randomly sprinkling doom text into pretraining data mixtures" --- seeing whether the effects remain strong.

[+][comment deleted]1y10

Moderation Log