Fabien Roger

Sequences

AI Control

Wiki Contributions

Comments

Sorted by

Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:

  • On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not having misaligned thoughts (in CoT) very frequently and 2. the model is not having such thoughts in circumstances where it is most crucial for GPT-n to be deceptive.
  • If the model doesn't use Klingon or dense jargon in its Chain-of-Thoughts (which I predict it won't), monitor all CoT of GPT-n with a weaker cheaper model (and/or a regex). If something looks suspicious to the weaker monitor, you can:
    • call your team of linguists and AI experts to analyze it further
    • or resample the CoT (this likely was just a fluke, since most CoT don't have worrying signs)
    • or give up on this particular query (e.g. send an error message to your user, use another weaker model to generate an answer, ...)
    • or ask another copy of GPT-n to analyze the CoT (though this can be tricky to get right because of the risk of correlated failures), only calling your team of linguists and AI experts if more evidence of this CoT being suspicious is found.

This is not perfect, but I think this can represent an important layer of defense against the risks of AIs that are powerful enough to speed up alignment research but that are not vastly superhuman.

Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.

If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity.

There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.

I like this post. I made similar observations and arguments in The Translucent Thoughts Hypotheses and Their Implications but these were less clear and buried at the end of the post.

Good fine-tuning robustness (i.e. creating models which attackers have a hard time fine-tuning to do a target task) could make the framework much harder to apply. The existence of such technique is a main motivation for describing it as an adversarial framework rather than just saying "just do fine-tuning". All existing tamper resistant technique can be broken (Repnoise fails if you choose the learning rate right, Tamper-resistant fine-tuning method fails if you use LoRA ...), and if you use unlearning techniques which look like that, you should really do the basic fine-tuning variations that break Repnoise and Tamper-resistant fine-tuning when evaluating your technique.

This creates a bit of FUD, but I expect the situation to be much less bad than in the jailbreak robustness literature, since I expect fine-tuning robustness to be much more offense-favored than jailbreak robustness (i.e. even playing the attacker a little gets you massive returns, whereas in jailbreak robustness it's super annoying to play the attacker properly) because in fine-tuning robustness it is much harder to patch individual attacks than in jailbreak robustness. I think good fine-tuning robustness (given full access and decent resources to the attacker) is hopeless, and this makes the evaluation of unlearning using the method described here possible.

But if fine-tuning robustness was ever successful, this would make the situation much more scary: maybe you could make your "unlearning" method model "forget" the hazardous info on most prompts, but "remember it" if there is a password in the prompt (i.e. people can't extract the information if they don't know what the password is, even with white-box access to the weights). If this were the case, there would be something horrific going on because the hazardous information would clearly still in the weights, but the method described in this post would claim there isn't. If you applied the same unlearning technique in a more straightforward way (without the password backdoor) and people didn't manage to extract the information, I would be afraid the information would still be there.

Overall, making a safety case from these techniques seems really rough. I don't see why someone would believe that (conditional on non-scheming) you have a 90% chance of getting a 20x Bayes factor against scheming. I don’t think this is just me saying “it’s very hard to get a 20x Bayes factor because that requires a lot of confidence”—I think that all the sources of uncertainty I described above are much worse than approaches to safety cases that rest on e.g. either inability or on control

In this post, when you say getting evidence against scheming, you more generally mean excluding AIs that will successfully cause doom? In particular:

  • BRT is terrible at excluding extremely cautious schemers that would do anything bad only in very weird circumstances that are very unlikely to ever happen in deployment. But most of the edge from BRT is showing that schemers would need to be extremely cautious (i.e. if BRT shows no danger, I think your P(doom|scheming) probably got a bigger update than your P(scheming)).
  • Control does not change P(scheming), and I think it's unlikely any class of technique will give us something close to a 20x Bayes factor against scheming TAI. So the post makes more sense to me if it is about Bayes factors against doom. 

I think this post makes good arguments showing that BRT can't give you a big Bayes factor against doom, and in particular I really like the more realistic example of the RSA-2048-strategy.

the returns to effort are such that other AI control techniques should have three times as much effort put into them

There are some shared things you need for BRT and other control techniques, and in particular you need good threat modeling and a good automated catastrophe detection for both inference-time monitoring and BRT. If you exclude security and things like not giving internet access to AIs unless they really need to from "other AI control techniques" (as people usually don't think about them in these terms), I think 3x between BRT (minus threat modeling and catastrophe detection) other-control (minus threat modeling and catastrophe detection) is too big. Threat modeling, BRT and inference-time monitoring were already some of the biggest effort sinks for control anyway. The other big control efforts I can think about figuring out how to use more trusted labors in the scaffolds, how to do resampling well, how to prevent input-coordination and how to prevent steg, maybe I am missing other big efforts?

I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:

  • Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
  • Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)

(The other classic ways in which cyber transfers to preventing takeovers are mostly about how when you prevent spies from stealing model weights or using models in unauthorized ways, you naturally create infra that also makes it hard for AIs to self-exfiltrate or run unmonitored.)

It was also fun to see arguments and drama that are close in kind to the kind of arguments about what are the best ways to prevent AI catastrophes: people in cybersecurity argue about whether focusing on vulnerabilities is right or whether it's better to focus on the human element, whether the focus on strong passwords is right, whether some solutions are too bothersome/costly to be used in practice, whether imposing specific cybersecurity standards is a good idea, ... It made me realize how niche AI most safety arguments must look like to people outside the AI safety community.

But I don't recommend reading the book if you are not curious about everyday cybersecurity. Most of the book is about more regular everyday cyberattacks (social engineering, common vulns, ...) cybersecurity (patching, passwords, ...), and advice for people who want to do cyber professionally. It has a bunch of resource recommendations, but I don't know yet how good they are.

Thanks for your explanations!

Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.

Good point. I agree that if you don't have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.

It is worth noting that if "training against the monitor" didn't work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?

all but "doing experiments" can be done with a neural net doing chain of thought

I would have thought there is a big gap between "doing tail-risk estimation for an NN forward pass" and "doing tail-risk estimation for an NN CoT".  Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like "after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe" - it's not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven't thought about it much, so maybe I am too pessimistic about dealing with the argmax?

Do you think that catastrophe detectors will be large neural networks?

If you do, how do you build a neural network (without sampling steps nor python execution steps) that implements your major hopes for getting a weak model to supervise a strong model?

  • We could give \(M_{weak}\) more time to think than \(M\), so it could examine each portion of the action in detail.
  • We could allow \(M_{weak}\) to interrogate \(M\) about the function of various parts of its action, and do detailed checks for inconsistencies in \(M\)'s answers.
  • We could allow \(M_{weak}\) to perform separate experiments to verify \(M\)'s claims about its actions, and to evaluate whether or not the actions have any side effects.
  • \(M_{weak}\) could be conservative when evaluating \(M\)'s action, flagging part of the action it did not understand as potentially catastrophic.

If we back down from such hopes that use a lot of inference-time compute, and just use a single weak-model forward pass, then it looks to me like trusted monitoring solves most (but not all) issues that estimating tail risks targets.

If you don't, do you think that it is easy to go from NN-tail-risk estimation to a more general form of tail-risk estimation? If so, I'm curious what the hope is, and if you don't, I think you should put more emphasis on it, so that people don't anchor to much on the difficulty of the easier NN-tail-risk-estimation problem, and maybe start attempting right now to solve things like the LLM-bureaucracy-tail-risk-estimation problem.

I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.

I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)

Some things I learned:

  • The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
    • If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
    • If AI takeoff is relatively slow, maybe the ability to actually make a huge amount of money selling AI in the economy may determine who ends up in front? (There are some strong disanalogies though, algorithmic progress and AI weights might be much easier to steal than chip-making abilities.)
    • China is not like the USSR: it actually has a relatively developed tech sector and high GDP. But the chip industry became an enormous interconnected beast that is hard to reproduce domestically, which means it is hard for anyone (including the US) to build a chip industry that doesn't rely on international partners. (Analysts are pretty divided on how much China can reduce its reliance on foreign chips.)
  • The US initially supported the Japanese chip industry because it wanted Japan to have strong commercial ties to the US. Japan became too good at making chips, and Taiwanese / South Korean companies were able to get support from the US (/not get penalized for massively helping their national chip champions) to reduce Japanese dominance - and now TSMC dominates. Economic policies are hard to get right... (The author sometimes says stuff like "US elites were too ideologically committed to globalization", but I don't think he provides great alternative policies.)
  • It's amazing how Intel let a massive advantage slip. It basically had a monopoly over logic chip design (Intel microprocessors, before GPUs mattered), chip architecture (x86), and a large share of logic chip manufacturing (while Japanese/Taiwan/... were dominating in other sectors, like RAM, special purpose chips, ...). It just juiced its monopoly, but tried to become a foundry and a GPU designer when it was already too late, and now it has a market cap that is 1/3rd of AMD, 1/10th of TSMC and 1/30th of Nvidia. But it's the main producer of chips in the US, it's scary if the US bets on such a company...
  • China might be able to get Taiwan to agree to things like "let TSMC sell chips to China" or "let TSMC share technology with Chinese companies".
    • I underestimated the large space of possible asks China could care about that are not "get control over Taiwan".
    • I will continue to have no ability to predict the outcome of negotiations, the dynamics are just too tricky when players are so economically dependent on all the other players (e.g. China imports ~$400B worth of chips per year, 13% of all its imports).

Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.

LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.

I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:

  • ~7k problem statements + pwd --> stronger answer
  • ~7k problem statements --> weak answer
  • {fixed set of n problems --> strong answer} repeated m times

The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the set of "known problems on which I should perform well" if the "if familiar, perform well" function is already implemented.

I train for one epoch, and then use SFT[1] on one of 3 sets of 2 samples, and train for 200 epochs, and eval on n=100 points. I find that elicitation is just as easy on the models that were trained to perform well on a fixed set of problems. I find that somewhat surprising, so I might have done something wrong. Code here.

I'm not sure how to interpret these results. Maybe I didn't memorize enough sequences, and so the model never learned a general "if familiar, then strong performance" condition (which would mean scheming model could be much less sample efficient than pwd-locked models). Or maybe SGD unlocks the model about as fast as it makes sequences more familiar (which would mean schemer would struggle to avoid being unlocked using 2 samples - like password-locked models).

Maybe evals should do the right thing and only train for one epoch - which is much less likely[2] to have this issue. This may become cheaper, as bigger/smarter models require fewer SGD steps to fit a training dataset.

  1. ^

    I compute the loss on the prompt and the answer - which is the worst case, since it should help memorization of the prompt.

  2. ^

    Not guaranteed to work: maybe SGD can notice that "making the sequence more familiar" is enough to get high performance, and so most of the update could be made in this direction, resulting in little-to-no general unlocking.

Load More