Fabien Roger

Sequences

AI Control

Wiki Contributions

Comments

Sorted by

I listened to the book Deng Xiaoping and the Transformation of China and to the lectures The Fall and Rise of China. I think it is helpful to understand this other big player a bit better, but I also found this biography and these lectures very interesting in themselves: 

  • The skill ceiling on political skills is very high. In particular, Deng's political skills are extremely impressive (according to what the book describes):
    • He dodges bullets all the time to avoid falling in total disgrace (e.g. by avoiding being too cocky when he is in a position of strength, by taking calculated risks, and by doing simple things like never writing down his thoughts)
    • He makes amazing choices of timing, content and tone in his letters to Mao
    • While under Mao, he solves tons of hard problems (e.g. reducing factionalism, starting modernization) despite the enormous constraints he worked under
    • After Mao's death, he helps society make drastic changes without going head-to-head against Mao's personality cult
    • Near his death, despite being out of office, he salvages his economic reforms through a careful political campaign
    • According to the lectures, Mao is also a political mastermind that pulls off coming and staying in power despite terrible odds. Communists were really not supposed to win the civil war (their army was minuscule, and if it wasn't for weird WW2 dynamics that they played masterfully, they would have lost by a massive margin), and Mao was really not supposed to be able to remain powerful until his death despite the great leap and the success of reforms.
    • --> This makes me appreciate what it is like to have extremely strong social and political skills. I often see people's scientific, mathematical or communication skills being praised, so it is interesting to remember that other skills exist too and have a high ceiling too. I am not looking forward to the scary worlds where AIs have these kinds of skills.
  • Debates are weird when people value authority more than arguments. Deng's faction after Mao's death banded behind the paper Practice is the Sole Criterion for Testing Truth to justify rolling out things Mao did not approve of (e.g. markets, pay as a function of output, elite higher education, ...). I think it is worth a quick skim. It is very surprising how a text that defends a position so obvious to the Western reader does so by relying entirely on the canonical words and actions from Mao and Marx without making any argument on the merits. It makes you wonder if you have similar blind spots that will look silly to your future self.
  • Economic growth does not prevent social unrest. Just because the pie grows doesn't mean you can easily make everybody happy. Some commentators expected the CCP to be significantly weakened by the 1989 protests, and without military actions that may have happened. 1989 was a period where China's GDP had been growing by 10% for 10 years and would continue growing at that pace for another ten.
  • (Some) revolutions are horrific. They can go terribly wrong, both because of mistakes and conflicts:
    • Mistakes: the great leap is basically well explained by mistakes: Mao thought that engineers are useless and that production can increase without industrial centralization and without individual incentives. It turns out he was badly wrong. He mistakenly distrusted people who warned him that the reported numbers were inflated. And so millions died. Large changes are extremely risky when you don't have good enough feedback loops, and you will easily cause catastrophe without bad intentions. (~according to the lectures)
    • Conflicts: the Cultural Revolution was basically Mao using his cult of personality to gain back power by leveraging the youth to bring down the old CCP officials and supporters while making sure the Army didn't intervene (and then sending the youth that brought him back to power to the countryside) (~according to the lectures)
  • Technology is powerful: if you dismiss the importance of good scientists, engineers and other technical specialists, a bit like Mao did during the great leap, your dams will crumble, your steel will be unusable, and people will starve. I think this is an underrated fact (at least in France) that should make most people studying or working in STEM proud of what they are doing.
  • Societies can be different. It is easy to think that your society is the only one that can exist. But in the society that Deng inherited:
    • People were not rewarded based on their work output, but based on the total outcome of groups of 10k+ people
    • Factory managers were afraid of focusing too much on their factory's production
    • Production targets were set not based on demands and prices, but based on state planning
    • Local authorities collected taxes and exploited their position to extract resources from poor peasants
    • ...
  • Governments close to you can be your worst enemies. USSR-China's relations were often much worse than US-China ones. This was very surprising to me. But I guess that having your neighbor push for reforms while you push for radicalism, dismantle a personality's cult like the one you are hoping will survive centuries, and mass troops along your border because it is (justifiably?) afraid you'll do something crazy really doesn't make for great relationships. There is something powerful in the fear that an entity close to you sets a bad example for your people.
  • History is hard to predict. The author of the lectures ends them by making some terrible predictions about what would happen after 2010, such as expecting the ease of US-China relations and expecting China to become more democratic before 2020. He did not express much confidence in these predictions, but it is still surprising to see him so directionally wrong about where China's future. The author also acknowledges past failed predictions, such as the outcome of the 1989 protests.
  • (There could have been lessons to be drawn about how great markets are, but these books are not great resources on the subject. In particular, they do not give elements to weigh the advantages of prosperity against the problems of markets (inflation, uncertainty, inequalities, changes in values, ...) that caused so much turmoil under Deng and his successors. My guess is that it's obviously net positive given how bad the situation was under Mao and how the USSR failed to create prosperity, but this is mostly going off vague historical vibes, not based on the data from these resources.)

Both the lectures and the book were a bit too long, especially the book (which is over 30 hours long). I still recommend the lectures if you want to have an overview of 20th-century Chinese history, and the book if you want to get a better sense of what it can look like to face a great political strategist.

Here are the 2024 AI safety papers and posts I like the most.

The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.

Important ideas - Introduces at least one important idea or technique.

★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled) 
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AISituational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits) 
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?) 
★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural NetworkMeasuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments) 
★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)
★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems) 
★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming)
★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits) 
★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning)

Surprising findings - Presents some surprising facts about the world

★★ A surprisingly effective way to make models drunk (Mechanistically Eliciting Latent Behaviors in Language Models) 
★★ A clever initialization for unsupervised explanations of activations (SelfIE) 
★★ Transformers are very bad at single-forward-pass multi-hop reasoning (Yang 2024Yang 2024Balesni 2024Feng 2024)
★ Robustness for ViT is not doomed because of low transfer (When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?) 
★ Unlearning techniques are not even robust to changing how questions are framed (Eight methods to evaluate robust unlearning in LLMs) 
★ For some tasks, OOCR is surprisingly good (Connecting the Dots) 
★ Nice emergence scaling laws with fine-tuning (Predicting Emergent Capabilities by Finetuning)
★ Fine-tuning robustness is much worse than the papers report it to be (On Evaluating the Durability of Safeguards for Open-weight LLMs) 

Good demos - Is a good illustration of some previous, important idea. Bonus points (✪) for experimental results are especially interesting.

★★★ Alignment faking persisting through safety training (Sleeper Agents) 
★★✪ Alignment faking emerging plausibly naturally (Alignment faking in large language models)
★✪ The bad things that happen when you tell AIs to follow a goal or threaten it to be less useful (Frontier Models are Capable of In-context Scheming) 
★✪ Minor-to-major misgeneralization in reward-hackable settings (Sycophancy to subterfuge)
★ More realistic backdoor triggers (Future Events as Backdoor Triggers) 
★ Advanced backdoor behaviors (AdaptativeBackdoor: Backdoored Language Model Agents that Detect Human Overseers) 
★ Realistic reward hacking in RLHF (Language models learn to mislead humans via RLHF)
★ Reward hacking plausibly naturally resulting in LLM deception (On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback) 
★ LLMs already have non-trivial situational awareness (SAD)

Important empirical results and datasets - Runs an experiment whose results are informative and/or a useful benchmark

★★ How good are probes against adversarial attacks that target them, and many other cool related results (including Obfuscated adversarial training, which looks promising to me) (Obfuscated Activations Bypass LLM Latent-Space Defenses)
★★ A realistic generalization eval for adversarial robustness and an evaluation of probing (Improving Alignment and Robustness with Circuit BreakersHarmBench)
★★ Measurement of the efficiency of SFT and RL vs sandbagging (Stress-Testing Capability Elicitation With Password-Locked Models) 
★★ The first RL run that results in natural emergence of steganography (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs) 
★★ Tracking what AI labs are doing (Zach Stein-Perlman’s posts)
★ A first attempt at measuring whether AIs are smart enough to pursue scary scheming strategies (Subversion Strategy Eval)
★ Head-to-head comparisons to truth probes (How well do truth probes generalise?) 
★ Head-to-head comparisons of black-box jailbreaking-robustness techniques: (Jailbreak Defense in a Narrow Domain) 
★ Latent adversarial training is decent (Defending Against Unforeseen Failure Modes with Latent Adversarial Training, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
★ Measuring how easy it is to poison pretrain data (Persistent pre-training poisoning of LLMs) 
★ A good dataset for unlearning (The WMDP Benchmark) 
★ Better SAEs and SAE evals (Scaling and evaluating sparse autoencoders) 
★ What happens when you train LLMs to 1. Maximize gsm8k accuracy 2. produce gms8k answers that are incorrect but look correct (Prover-Verifier Games improve legibility of LLM outputs) 
★ A good persuasion capability evals (Sabotage Evaluations for Frontier Models) 
★ An eval of how good Sonnet 3.5 is at disabling a monitor in subtle ways (A Toy Evaluation of Inference Code Tampering)
★ An overview of how well debate and baselines work across a wider range of settings (On scalable oversight with weak LLMs judging strong LLMs)
★ Claude.ai is used for roughly what you would expect (Clio)
★ Relatively hard and relevant capability benchmarks (RE-BenchSWE-Bench
★ And all the big dangerous capability evals…

Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.

This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers' ideas and shortcomings.

I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge.

I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper.

It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements.

I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity.

I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.

We committed an important mistake in the dataset creation process for random birthdays. We will rerun our experiments on this dataset and release an update. Early results suggest that unlearning on this dataset works relatively well, which suggests that experimenting with unlearning synthetic facts that a model was fine-tuned on might not be a reliable way of studying unlearning.

I think this is a valuable contribution. I used to think that Demix-like techniques would dominate in this space because in principle they could achieve close-to-zero alignment tax, but actually absorption is probably crucial, especially in large pre-training runs where models might learn with very limited mislabeled data.

I am unsure whether techniques like gradient routing can ever impose a <10x alignment tax, but I think like a lot can be done here (e.g. by combining Demix and gradient routing, or maybe by doing something more clean, though I don't know what that would look like), and I would not be shocked if techniques that descend from gradient routing became essential components of 2030-safety.

This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).

This post is a great explainer of why prompt-based elicitation is insufficient, why iid-training-based elicitation can be powerful, and why RL-based elicitation is powerful but may still fail. It also has the merit of being relatively short (which might not have been the case if someone else had introduced the concept of exploration hacking). I refer to this post very often.

Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:

  • On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not having misaligned thoughts (in CoT) very frequently and 2. the model is not having such thoughts in circumstances where it is most crucial for GPT-n to be deceptive.
  • If the model doesn't use Klingon or dense jargon in its Chain-of-Thoughts (which I predict it won't), monitor all CoT of GPT-n with a weaker cheaper model (and/or a regex). If something looks suspicious to the weaker monitor, you can:
    • call your team of linguists and AI experts to analyze it further
    • or resample the CoT (this likely was just a fluke, since most CoT don't have worrying signs)
    • or give up on this particular query (e.g. send an error message to your user, use another weaker model to generate an answer, ...)
    • or ask another copy of GPT-n to analyze the CoT (though this can be tricky to get right because of the risk of correlated failures), only calling your team of linguists and AI experts if more evidence of this CoT being suspicious is found.

This is not perfect, but I think this can represent an important layer of defense against the risks of AIs that are powerful enough to speed up alignment research but that are not vastly superhuman.

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.

There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.

Load More