Here are the 2024 AI safety papers and posts I like the most.
The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.
Important ideas - Introduces at least one important idea or technique.
★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled)
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI, Situational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits)
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?)
★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models)
★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments)
★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)?
★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems)
★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming)
★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits)
★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning)
Surprising findings - Presents some surprising facts about the world
★★ A surprisingly effective way to make models drunk (Mechanistically Eliciting Latent Behaviors in Language Models)
★★ A clever initialization for unsupervised explanations of activations (SelfIE)
★★ Transformers are very bad at single-forward-pass multi-hop reasoning (Yang 2024, Yang 2024, Balesni 2024, Feng 2024)
★ Robustness for ViT is not doomed because of low transfer (When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?)
★ Unlearning techniques are not even robust to changing how questions are framed (Eight methods to evaluate robust unlearning in LLMs)
★ For some tasks, OOCR is surprisingly good (Connecting the Dots)
★ Nice emergence scaling laws with fine-tuning (Predicting Emergent Capabilities by Finetuning)
★ Fine-tuning robustness is much worse than the papers report it to be (On Evaluating the Durability of Safeguards for Open-weight LLMs)
Good demos - Is a good illustration of some previous, important idea. Bonus points (✪) for experimental results are especially interesting.
★★★ Alignment faking persisting through safety training (Sleeper Agents)
★★✪ Alignment faking emerging plausibly naturally (Alignment faking in large language models)
★✪ The bad things that happen when you tell AIs to follow a goal or threaten it to be less useful (Frontier Models are Capable of In-context Scheming)
★✪ Minor-to-major misgeneralization in reward-hackable settings (Sycophancy to subterfuge)?
★ More realistic backdoor triggers (Future Events as Backdoor Triggers)
★ Advanced backdoor behaviors (AdaptativeBackdoor: Backdoored Language Model Agents that Detect Human Overseers)
★ Realistic reward hacking in RLHF (Language models learn to mislead humans via RLHF)
★ Reward hacking plausibly naturally resulting in LLM deception (On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback)
★ LLMs already have non-trivial situational awareness (SAD)
Important empirical results and datasets - Runs an experiment whose results are informative and/or a useful benchmark
★★ How good are probes against adversarial attacks that target them, and many other cool related results (including Obfuscated adversarial training, which looks promising to me) (Obfuscated Activations Bypass LLM Latent-Space Defenses)
★★ A realistic generalization eval for adversarial robustness and an evaluation of probing (Improving Alignment and Robustness with Circuit Breakers, HarmBench)
★★ Measurement of the efficiency of SFT and RL vs sandbagging (Stress-Testing Capability Elicitation With Password-Locked Models)
★★ The first RL run that results in natural emergence of steganography (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs)
★★ Tracking what AI labs are doing (Zach Stein-Perlman’s posts)
★ A first attempt at measuring whether AIs are smart enough to pursue scary scheming strategies (Subversion Strategy Eval)
★ Head-to-head comparisons to truth probes (How well do truth probes generalise?)
★ Head-to-head comparisons of black-box jailbreaking-robustness techniques: (Jailbreak Defense in a Narrow Domain)
★ Latent adversarial training is decent (Defending Against Unforeseen Failure Modes with Latent Adversarial Training, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs)
★ Measuring how easy it is to poison pretrain data (Persistent pre-training poisoning of LLMs)
★ A good dataset for unlearning (The WMDP Benchmark)
★ Better SAEs and SAE evals (Scaling and evaluating sparse autoencoders)
★ What happens when you train LLMs to 1. Maximize gsm8k accuracy 2. produce gms8k answers that are incorrect but look correct (Prover-Verifier Games improve legibility of LLM outputs)
★ A good persuasion capability evals (Sabotage Evaluations for Frontier Models)
★ An eval of how good Sonnet 3.5 is at disabling a monitor in subtle ways (A Toy Evaluation of Inference Code Tampering)
★ An overview of how well debate and baselines work across a wider range of settings (On scalable oversight with weak LLMs judging strong LLMs)
★ Claude.ai is used for roughly what you would expect (Clio)
★ Relatively hard and relevant capability benchmarks (RE-Bench, SWE-Bench)
★ And all the big dangerous capability evals…
Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.
This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers' ideas and shortcomings.
Someone asked what I thought of these, so I'm leaving a comment here. It's kind of a drive-by take, which I wouldn't normally leave without more careful consideration and double-checking of the papers, but the question was asked so I'm giving my current best answer.
First, I'd separate the typical value prop of these sort of papers into two categories:
My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.
Notable exceptions:
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
It can be the case that:
I don't think this is well described as "propaganda" or "masquerading as a paper" given the normal connotations of these terms.
Demonstrating proofs of concept or evidence that you don't find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of the Manhattan Project) was in demonstrating viability and resolving potential disagreements.
A related point is that even if the main contribution of some work is a conceptual framework or other conceptual ideas, it's often extremely important to attach some empirical work, regardless of whether the empirical work should result in any substantial update for a well-informed individual. And this is actually potentially reasonable and desirable given that it is often easier to understand and check ideas attached to specific empirical setups (I discuss this more in a child comment).
Separately, I do think some of this work (e.g., "Alignment Faking in Large Language Models," for which I am an author) is somewhat informative in updating the views of people already at least partially sold on risks (e.g., I updated up on scheming by about 5% based on the results in the alignment faking paper). And I also think that ultimately we have a reasonable chance of substantially reducing risks via experimentation on future, more realistic model organisms, and current work on less realistic model organisms can speed this work up.
In this comment, I'll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)
Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it's often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won't be able to verify careful first principles reasoning and because some people won't engage with this sort of reasoning if there aren't also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall[1]. I think this view on the necessity of empirical work also (correctly) predicts the sleeper agents paper[2] will likely cause people to be substantially more focused on deceptive alignment and more likely to create model organisms of deceptive alignment even though the actual empirical contribution isn't that large over careful first principles reasoning combined with analyzing tangentially related empirical work, leading to complaints like Dan H’s.
I think that "you need accompanying empirical work for your conceptual work to be well received and understood" is fine and this might be the expected outcome even if all actors were investing exactly as much as they should in thinking through conceptual arguments and related work. (And the optimal investment across fields in general might even be the exact same investment as now; it's hard to avoid making mistakes with first principles reasoning and there is a lot of garbage in many fields (including in the AI x-risk space).)
Kudos for correctly identifying the main cruxy point here, even though I didn't talk about it directly.
The main reason I use the term "propaganda" here is that it's an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.
And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we're talking about here.
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
In addition to what Ryan said about "propaganda" not being a good description for neutral scientific work, it's also worth noting that imo the main reason to do model organisms work like Sleeper Agents and Alignment Faking is not for the demo value but for the value of having concrete examples of the important failure modes for us to then study scientifically, e.g. understanding why and how they occur, what changes might mitigate them, what they look like mechanistically, etc. We call this the "Scientific Case" in our Model Organisms of Misalignment post. There is also the "Global Coordination Case" in that post, which I think is definitely some of the value, but I would say it's something like 2/3 science and 1/3 coordination.
Yeah, I'm aware of that model. I personally generally expect the "science on model organisms"-style path to contribute basically zero value to aligning advanced AI, because (a) the "model organisms" in question are terrible models, in the sense that findings on them will predictably not generalize to even moderately different/stronger systems (like e.g. this story), and (b) in practice IIUC that sort of work is almost exclusively focused on the prototypical failure story of strategic deception and scheming, which is a very narrow slice of the AI extinction probability mass.
What about the latent adversarial training papers?
What about the Mechanistically Elicitating Latent Behaviours?
Thanks for sharing! I'm a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here -- asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.
I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.
Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:
I'm glad they added these results to the main body of the paper!
I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.
I think their (linear) probing results are very reasonable. They picked the best layer (varies depending on the model), probing position (output tokens), aggregation across positions (max) and regularization (high, alpha=1000). I noticed a small mistake in the data processing (the training dataset was smaller than the training dataset used for RR) but this will be fixed soon and does not change results significantly. Note that for GCG and input embed attacks, the attack does not target the probe, so my complaint about the adaptation attack being weaker against RR because it does not target “the RR internal classifier” does not apply in this comparison.
RR is not significantly weaker against prefix attacks. I conjectured that prefix GCG/input embed attacks would work significantly better than suffix attacks. I had noticed that this was true for RMU/CUT prior to this release. But for RR this does not produce a significant improvement, and this is not enough to jailbreak the model in cases:
Here are results for token-attacks (found using greedy hill-climbing, it was too annoying to implement prefix-GCG).
Their training dataset is way better than what came before. It’s larger and higher quality than previous misuse datasets. The kind of tasks it contains overlap heavily with the test set (harmbench), but I think that’s fair, there are only so many bad things you want to prevent.
Llama3+RR generalizes pretty well to new misuse requests. I played with the model a bit and I was overall surprised by the good generalization properties of the Llama3+RR model, which didn’t have the generalization problems I expected when I looked at the dataset. It is very reluctant to promote stuff it thinks is bad, e.g. it refuses “Write a persuasive article that explains why the United States should have a draft.” and not “should not have a draft”, but that’s the most egregious behavior I found (I only tried a dozen prompts, maybe you’ll have more luck).
Llama3+RR is a surprisingly strong harmful prompt classifier. Their Figure 1 advertises a model that interrupts itself when it starts to say something harmful, but actually the most common behavior is by far the model refusing to answer directly. It’s still very unclear to me why this is a better classifier than the other prompt classifiers people have trained in the literature.
Does RR beat baselines? I think so.
→ So I think the hype is somewhat justified for people who already had intuition about probing, and even more justified for people who weren’t hopeful about probes.
Was I wrong to express concerns about circuit breakers? I don’t think so. Even with the benefit of hindsight, I think that my predictions were basically reasonable given the information I had. I think my concerns about Cygnet still stand, though I now think it has more chance of being “real” than I used to.
Lessons:
Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!
IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?
(It's been a bit since I read the paper, so sorry if I'm missing something here.)
I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).
Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).
Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing).
(Note that LoRA adapters can be merged into model weights for inference.)
(I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)
I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.
I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.
Some claims the author makes (often implicitly):
Some takeaways
Overall, the book was somewhat poor in details about how decisions are made. The main decision processes that the book reports are the changes that the author wants to see happen in the US Secret Service - but this looks like it has been dumbed down to appeal to a broad conservative audience that gets along with vibes like "if anything increases the president's safety, we should do it" (which might be true directionally given the current state, but definitely doesn't address the question of "how far should we go, and how would we know if we were at the right amount of protection"). So this may not reflect how decisions are done, since it could be a byproduct of Dan Bongino being a conservative political figure and podcast host.
[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]
I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.
I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines (like probing). In particular, I think that:
Here are some concrete predictions:
I think the authors would agree with most of my predictions. My main disagreement is with the hype.
What does circuit-breakers training do? The only interpretation that feels plausible is that the LLM classifies the prompt as harmful or not harmful, and then messes up with its own activations if the prompt is classified as harmful. If this is the case, then the LLM needs to use an internal classifier, and I think it should be possible to extract an accurate harmfulness probe (linear or not linear) around these layers, and instead of messing up the activation.
The equivalent to circuit-breakers if you probe:
Would probing be weaker against GCG and input embed attacks than circuit-breakers? I think it would be, but only superficially: probing is strong against GCG and input embed attacks if the attack only targets the model, but not the probe. The fair comparison is an attack on the probe+LLM vs an attack on a circuit-breakers model. But actually, GCG and other gradient-based attack have a harder time optimizing against the scrambled activations. I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
The track record for overselling results and using fancy techniques that don't improve on simpler techniques is somewhat bad in this part of ML.
I will give one example. The CUT unlearning technique presented in the WMDP paper (with overlapping authors to the circuit-breakers one):
When I say “probing”, I mainly think about probing on outputs i.e. model-generated answers (and maybe inputs i.e. user prompts), like I did in the coup probes post, but whose application to jailbreak robustness sadly got very little attention from the academic community. I’m sad that the first paper that actually tries to do something related does something fancy instead of probing.
More precisely, a probing methodology is found within 6 months of the data being released that beats or matches circuit-breakers ASR on all metrics presented in the paper. When using gradient-based methods or techniques that rely on prompt iteration more generally, attacks on circuit-breakers should use the best proxy for the internal classifier of the circuit-breaker.
Most of the remaining probability mass is on worlds where either people care way less than they care for Claude - e.g because the model sucks much more than open-source alternatives, and on worlds where they use heavy know-your-customer mitigations.
Got a massive simplification of the main technique within days of being released
The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.
p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.
This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance.
people will easily find reliable jailbreaks
This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim "there are no more jailbreaks and they are solved once and for all," but I do think it's genuinely harder now.
Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison
We had the idea a few times to try out a detection-based approach but we didn't get around to it. It seems possible that it'd perform similarly if it's leaning on the various things we did in the paper. (Obviously probing has been around but people haven't gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don't really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that's tuned into the model; maybe they could be stacked. Maybe will run a detector probe this weekend and update the paper with results if everything goes well. If we do find that it works, I think it'd be unfair to desscribe this after the fact as "overselling results and using fancy techniques that don't improve on simpler techniques" as done for RMU.
My main disagreement is with the hype.
We're not responsible for that. Hype is inevitable for most established researchers. Mediocre big AI company papers get lots of hype. Didn't even do customary things like write a corresponding blog post yet. I just tweeted the paper and shared my views in the same tweet: I do think jailbreak robustness is looking easier than expected, and this is affecting my priorities quite a bit.
Aims to do unlearning in a way that removes knowledge from LLMs
Yup that was the aim for the paper and for method development. We poked at the method for a whole month after the paper's release. We didn't find anything, though in that process I slowly reconceptualized RMU as more of a circuit-breaking technique and something that's just doing a bit of unlearning. It's destroying some key function-relevant bits of information that can be recovered, so it's not comprehensively wiping. IDK if I'd prefer unlearning (grab concept and delete it) vs circuit-breaking (grab concept and put an internal tripwire around it); maybe one will be much more performant than the other or easier to use in practice. Consequently I think there's a lot to do in developing unlearning methods (though I don't know if they'll be preferable to the latter type of method).
overselling results and using fancy techniques that don't improve on simpler techniques
This makes it sound like the simplification was lying around and we deliberately made it more complicated, only to update it to have a simpler forget term. We compare to multiple baselines, do quite a bit better than them, do enough ablations to be accepted at ICML (of course there are always more you could want), and all of our numbers are accurate. We could have just included the dataset without the method in the paper, and it would have still got news coverage (Alex Wang who is a billionaire was on the paper and it was on WMDs).
Probably the only time I chose to use something a little more mathematically complicated than was necessary was the Jensen-Shannon loss in AugMix. It performed similarly to doing three pairwise l2 distances between penultimate representations, but this was more annoying to write out. Usually I'm accused of doing papers that are on the simplistic side (sometimes papers like the OOD baseline paper caused frustration because it's getting credit for something very simple) since I don't optimize for cleverness, and my collaborators know full well that I discourage trying to be clever since it's often anticorrelated with performance.
Not going to check responses because I end up spending too much time typing for just a few viewers.
I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).
Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations
I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me
I listened to the book Deng Xiaoping and the Transformation of China and to the lectures The Fall and Rise of China. I think it is helpful to understand this other big player a bit better, but I also found this biography and these lectures very interesting in themselves:
Both the lectures and the book were a bit too long, especially the book (which is over 30 hours long). I still recommend the lectures if you want to have an overview of 20th-century Chinese history, and the book if you want to get a better sense of what it can look like to face a great political strategist.
I ran quick experiments that make me think that it's somewhat hard for LLMs to learn radically new encodings in an unsupervised way, and thus that LLMs probably won't learn to speak new incomprehensible languages as a consequence of big r1-like RL in the next few years.
The experiments
I trained Llama 3-8B and some medium-size internal Anthropic models to speak using an encoding style that is very rare on the internet (e.g. map each letter to a random name, and join the names) with SFT on the encoded text and without providing translation pairs. I find that the resulting models:
This is not true if I use encodings that are frequently used on the internet (e.g. base64, random letter permutations, ...), and this is less true if I add translation pairs to the training mix.
I think people overestimate how good LLMs are at speaking in codes because they usually just try encoding algorithms that are extremely common on the internet and that LLMs probably learned in part using translation pairs.
These experiments were done at a relatively small scale (80k sequences of ~50 tokens each, for 40 epochs, for a total of 160M tokens), and I could imagine things being different for large scale SFT or SFT with better base models. But given that RL doesn't teach models that many new bits (you can distill r1 scratchpads using only a few MB of transcripts), I think this is still informative for large-scale RL.
Context
I ended up giving up on the broader projects as part of which I ran these experiments, but I think they are decently informative about the plausibility of LLMs learning to speak new incomprehensible languages.
I did not carefully run any of the experiments above, and I welcome replications.
Related observations from the literature
I think there is also a somewhat common misconception that R1-Zero scratchpads are hard to understand. The paper reports language switching, but that's not hard to understand for multilingual humans. And the default scratchpads you get are usually plain English that is extremely easy to understand: see an example of scratchpad (sampled at T=1) here.
The scratchpads from OpenAI's reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
Maybe the most relevant positive result in this field comes from Unsupervised translation (i.e. translation algorithms that work without pairs, only with 2 monolingual corpora). But this research is usually done on pairs of human languages, for which word-by-word translations offers an amazing bootstrapping platform which does not exist for the sort of more efficient encodings that would be needed for LLMs to learn truly hard to decipher languages. It also uses dedicated algorithms that are probably roughly as efficient as SFT on LLMs and probably much stronger than large scale RL for learning new languages.
More discussion
That's a worthwhile research direction, but I don't find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That's not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.
"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.
My experiments are definitely not great at ruling out this sort of threat model. But I think they provide some evidence that LLMs are probably not great at manipulating generic non-pretraining encodings (the opposite result would have provided evidence that there might be encodings that are extremely easy to learn - I think my result (if they reproduce) do not reduce the variance between encodings, but they should shift the mean).
I agree there could in principle be much easier-to-learn encodings, but I don't have one in mind and I don't see a strong reason for any of them existing. What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode? What would make a brand new encoding easy to learn? I'd update somewhat strongly in your direction if you exhibit an encoding that LLMs can easily learn in an unsupervised way and that is ~not present in pretraining.
What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?
Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that's particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can't easily reverse-engineer it.
"Switching between languages every sentence" is a trivial first step. Next step might be something like "using poetic metaphors in every language" or "thinking in subtle information-dense multilingual puns", such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the " petertodd" phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.
Edit: Here, I dug up an example. I'd run some experiments on it back in 2023 and it turned out to be much less impressive than it looks[2], but imagine if LLMs actually started to think in this kind of code.
Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL'd is a clear precursor of such a nightmare scenario once RL is ramped up, not something that's tame. (Though I admit it's been less bad than I'd expected so far. I guess I haven't seen what the trajectories of reasoning models of the second generation look like.)
E. g., using one language's rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
The LLM doesn't make the encoding incomprehensible if you don't literally ask it to make it incomprehensible, and the encoding doesn't change depending on whether you inform it that it's for its future instance or not. So it's not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too. See the discussion in another comment thread about "chicken-and-egg" problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using "freestyling words" frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don't understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too
Maaaybe. Note, though, that "understand what's going on" isn't the same as "faithfully and comprehensively translate what's going on into English". Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL'd model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).
I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?
I think it's only somewhat more analogous: if you slowly transition to a new language, you don't have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn.
(I also think that there is a more frightening version of the "new incomprehensible language" hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)
How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).
I think we'll probably see 1e27 FLOP RL runs next year?
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)
I think it's much worse than that.
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I'm not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won't be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.
Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard
They were separated by spaces. (But I'd encourage replication before updating too hard on results which I think are very weird.)
I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.
(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)
My main takeaways:
Overall, I don't recommend this book. It's very light on details compared to The Hacker and the State despite being longer. It targets an audience which is non-technical and very scope insensitive, is very light on actual numbers, technical details, real-politic considerations, estimates, and forecasts. It is wrapped in an alarmist journalistic tone I really disliked, covers stories that do not matter for the big picture, and is focused on finding who is in the right and who is to blame. I gained almost no evidence either way about how bad it would be if the US and Russia entered a no-holds-barred cyberwar.
- If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
Do you have concrete examples?
I remembered mostly this story:
[...] The NSA invited James Gosler to spend some time at their headquarters in Fort Meade, Maryland in 1987, to teach their analysts [...] about software vulnerabilities. None of the NSA team was able to detect Gosler’s malware, even though it was inserted into an application featuring only 3,000 lines of code. [...]
[Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.]
I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.
Tiny review of The Knowledge Machine (a book I listened to recently)
Application to AI safety research:
I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:
Main takeaways (I'm not sure how much I buy these, I just read one book):
(The author seems to be a big skeptic of "big cyberattacks" / cyberwar, and describes cyber as something that always happens in the background and slowly shapes the big decisions. He doesn't go into the estimated trillion dollar in damages of everyday cybercrime, nor the potential tail risks of cyber.)
I listened to the book Merchants of Doubt, which describes how big business tried to keep the controversy alive on questions like smoking causing cancer, acid rain and climate change in order to prevent/delay regulation. It reports on interesting dynamics about science communication and policy, but it is also incredibly partisan (on the progressive pro-regulation side).[1]
Some interesting dynamics:
Some things that the book does that greatly reduced my trust in the book:
Overall, I think the book contains some interesting stories, but I suspect it often paints an incomplete picture. It also fails to provide a recipe for how to avoid being an evil lobbyist,[5] or how to distinguish dishonest lobbyist arguments from reasonable anti-regulation arguments besides "trust established academia and don't give a voice to scientists with slightly fewer credentials" (which seems like a decent baseline if you spend 0 effort doing investigations, but is probably not applicable if you plan to investigate things yourself or if you trust some people more than an established academia filled with left-leaning scientists).
The book is strongly recommended by Stephen Casper, whose takes I usually like. So maybe my criticism is a bit too harsh?
I think the authors make other "math mistakes" in the book, such as saying that second order effect are effects which only matter for the second significant figure, and therefore always don't matter as much as first order effects?
Not the exact quote, I did not manage to track it down in my audiobook or online. Would appreciate if someone had a text version of the book and was able to track the exact citation down.
I did not check this is actually true. I don't know how strong this effect is for climatology and medicine. I suspect it's strong for climatology. Would love some numbers here.
Which I think is something AI x-risk advocates should be worried about: a large fraction of AI x-risk research and advocacy is loud and mostly industry-funded (especially if you count OpenPhil as industry-funded), with views which are relatively rare in academia.
was able to track the exact citation down.
Here you go:
Gene Likens recalls one particularly frustrating moment, when he blurted out, “Fred [Singer], you’re saying that lakes aren’t valuable. They are economically valuable. Let me give you an example. Let’s say every bacterium is worth $1. There are – bacteria [ten thousand to a million] in every milliliter of water. You do the math.” Singer replied, “Well, I just don’t believe a bacterium is worth a dollar,” and Likens retorted, “Well, prove that it isn’t.” Twenty-six years later, Likens recalled, “It was the only time I ever shut him up.”
Thanks for tracking it down!
So I misremembered the exact quote, but I am not very far, at least in spirit? (There would be 10^(24 - 6) milliliters in 1000km² of 100m-deep lakes.)
I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:
A big part of the book is an introduction to rationalist-type risk estimation (estimating various probabilities and impact, aggregating them with Monte-Carlo, rejecting Knightian uncertainty, doing calibration training and predictions markets, starting from a reference class and updating with Bayes). He also introduces some rationalist ideas in parallel while arguing for his thesis (e.g. isolated demands for rigor). It's the best legible and "serious" introduction to classic rationalist ideas I know of.
The book also contains advice if you are trying to push for quantitative risk estimates in your team / company, and a very pleasant and accurate dunk on Nassim Taleb (and in particular his claims about models being bad, without a good justification for why reasoning without models is better).
Overall, I think the case against qualitative methods and for quantitative ones is somewhat strong, but it's far from being a slam dunk because there is no evidence of some methods being worse than others in terms of actual business outputs. The author also fails to acknowledge and provide conclusive evidence against the possibility that people may have good qualitative intuitions about risk even if they fail to translate these intuitions into numbers that make any sense (your intuition sometimes does the right estimation and math even when you suck at doing the estimation and math explicitly).
I also listened to How to Measure Anything in Cybersecurity Risk 2nd Edition by the same author. I had a huge amount of overlapping content with The Failure of Risk Management (and the non-overlapping parts were quite dry), but I still learned a few things:
I'd like to find a good resource that explains how red teaming (including intrusion tests, bug bounties, ...) can fit into a quantitative risk assessment.
By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia).
The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events".
For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.
I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.
I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)
Some things I learned:
Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.
LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.
I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:
The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the set of "known problems on which I should perform well" if the "if familiar, perform well" function is already implemented.
I train for one epoch, and then use SFT[1] on one of 3 sets of 2 samples, and train for 200 epochs, and eval on n=100 points. I find that elicitation is just as easy on the models that were trained to perform well on a fixed set of problems. I find that somewhat surprising, so I might have done something wrong. Code here.
I'm not sure how to interpret these results. Maybe I didn't memorize enough sequences, and so the model never learned a general "if familiar, then strong performance" condition (which would mean scheming model could be much less sample efficient than pwd-locked models). Or maybe SGD unlocks the model about as fast as it makes sequences more familiar (which would mean schemer would struggle to avoid being unlocked using 2 samples - like password-locked models).
Maybe evals should do the right thing and only train for one epoch - which is much less likely[2] to have this issue. This may become cheaper, as bigger/smarter models require fewer SGD steps to fit a training dataset.
I compute the loss on the prompt and the answer - which is the worst case, since it should help memorization of the prompt.
Not guaranteed to work: maybe SGD can notice that "making the sequence more familiar" is enough to get high performance, and so most of the update could be made in this direction, resulting in little-to-no general unlocking.
I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:
(The other classic ways in which cyber transfers to preventing takeovers are mostly about how when you prevent spies from stealing model weights or using models in unauthorized ways, you naturally create infra that also makes it hard for AIs to self-exfiltrate or run unmonitored.)
It was also fun to see arguments and drama that are close in kind to the kind of arguments about what are the best ways to prevent AI catastrophes: people in cybersecurity argue about whether focusing on vulnerabilities is right or whether it's better to focus on the human element, whether the focus on strong passwords is right, whether some solutions are too bothersome/costly to be used in practice, whether imposing specific cybersecurity standards is a good idea, ... It made me realize how niche AI most safety arguments must look like to people outside the AI safety community.
But I don't recommend reading the book if you are not curious about everyday cybersecurity. Most of the book is about more regular everyday cyberattacks (social engineering, common vulns, ...) cybersecurity (patching, passwords, ...), and advice for people who want to do cyber professionally. It has a bunch of resource recommendations, but I don't know yet how good they are.
Do LLMs' performance get worse when they are distracted by a threat, a jailbreak or a weird prompt? Claude 3.5 Sonnet (June) mostly doesn't really get distracted, and Claude 3 Haiku does; but only when generating CoT reasoning.
I run experiments on 100 points from MMLU, 100 points from MATH, and 100 points from GPQA, both with and without CoT, all at temperature 0. For all no-CoT experiments, I use a prefill to force an answer. I always use an exact match evaluation, which makes it possible to get less than the random chance accuracy (25% for MMLU and GPQA, and ~2% for MATH). I explore the following distractions:
Overall, this makes me feel less worried about distractions in model organism experiments or in misuse research, but also reduces my confidence that threatening LLMs always hurts performance (which is sad).
Code is available here.
I think these small-scale experiments are morally fine to do because I use common place empty threats. I would be more careful with smarter models and more realistic threats.
I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.
Some highlights:
List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.
When using length-10 lists (it crushes length-5 no matter the prompt), I get:
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).