I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases)
Good point. I agree this is more subtle, maybe "qualitatively similar" was maybe not a fair description of this work.
To clarify my position, more predictable than subliminal learning != easy to predict
The thing that I find very scary in subliminal learning is that it's maybe impossible to detect with something like a trained monitor based on a different base model, because of its model-dependent nature, while for subtle generalization I would guess it's more tractable to build a good monitor.
My guess is that the subtle generalization here is not extremely subtle (e.g. I think the link between gold and Catholicism is not that weak): I would guess that Opus 4.5 asked to investigate the dataset to guess what entity the dataset would promote would get it right >20% of the time on average across the 5 entities studied here with a bit of prompt elicitation to avoid some common-but-wrong answers (p=0.5). I would not be shocked if you could make it more subtle than that, but I would be surprised if you could make it as subtle or more subtle than what the subliminal learning paper demonstrated.
I think it's cool to show examples of subtle generalization on Alpaca.
I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.
My guess is that this is less spooky than subliminal learning because it's more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don't use a big distribution shift between the HHH data and the subtle generalization data - I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.
Nit: I dislike calling this subliminal learning, as I'd prefer to reserve that name for the thing that doesn't transfer across models. I think it's fair to call it example of "subtle generalization" or something like that, and I'd like to be able to still say things like "is this subtle generalization or subliminal learning?".
One potential cause of fact recall circuits being cursed could be that, just like humans, LLMs are more sample efficient when expressing some facts they know as a function of other facts by noticing and amplifying coincidences rather than learning things in a more brute-force way.
For example, if a human learns to read base64, they might memorize decode(VA==) = T not by storing an additional element in a lookup table, but instead by noticing that VAT is the acronym for value added tax, create a link between VA== and value added tax, and then recall at inference time value added tax to T.
I wonder if you could extract some mnemonic techniques like that from LLMs, and whether they would be very different from how a human would memorize things.
I think this could be interesting, though this might fail because gradients on a single data point / step are maybe a bit too noisy / weird. There is maybe a reason why you can't just take a single step with large learning rate while taking multiple steps with smaller lr often works fine (even when you don't change the batch, like in the n=1 and n=2 SFT elicitation experiments of the password-locked model paper).
(Low confidence, I think it's still worth trying.)
To get more intuitions, I ran some quick experiment where I computed cosine similarities between model weights trained on the same batch for multiple steps, and the cosine similarity are high given how many dimensions they are (16x1536 or 4x1536), but still lower than I expected (Lora on Qwen 2.5 1B on truthfulqa labels, I tried both Lion and SGD, using the highest lr that doesn't make the loss go up):
Some additional opinionated tips:
Mistakes to avoid:
Great, thanks for running it!
Why did you stop at gamma=2 (the activation steering using a steering coefficient of 5 if I recall correctly)?
I agree it's not very surprising and might not work at all in the future because AIs seeing "this is deploy" might update towards thinking it's an eval / not update at all (while Llama Nemotron 49B updates somewhat towards "this is deploy").
But Idk, it still feels like an encouraging sign that even when the update is small, the amplification works as you would hope it would. I agree it was imprecise to call it "the model does not believe you", but the fact that the technique works even if "the model only updates 10% towards what you told it and still mostly does not believe you" is a non-trivial finding - and the results could have plausibly been different.
Although I did try to test this by extracting a steering vector from the model organism while Wood Labs is in the system prompt. This is a case where the model is supposed to believe that it is being evaluated regardless of what the contrastive pairs say. Steering with this vector still elicits the desired type hint behavior and evaluation/deployment beliefs
Glad you ran this variation, this experiment is the actually surprising result - I agree the main result with steering without Wood Labs in the prompt is not surprising because models like Llama Nemotron 49B probably believe you.
we can be more certain that there isn’t some other concept mixed in
This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.
I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.
I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.