LLM support for writing LessWrong posts: virtual comments.
Back in August I discussed with Rafe & Oliver a bit about how to integrate LLMs into LW2 in ways which aren't awful and which encourage improvement---particularly using the new 'prompt caching' feature. To summarize one idea: we can use long-context LLMs with prompt caching to try to simulate various LW users of diverse perspectives to write useful feedback on drafts for authors.
(Prompt caching (eg) is the Transformer version of the old RNN hidden-state caching trick, where you run an input thro...
Yeah the LW team has been doing this sort of thing internally, still in the experimental phase. I don't know if we've used all the tricks listed here yet.
An example snippet of a rogue internal deployment threat model that I find fairly compelling:
What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
Project proposal: No-holds-barred jailbreak attacks and defenses for control
Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?
The blue team can try a variety of strategies here.
(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)
When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely o...
The term "AI misuse" encompasses two fundamentally different threat models that deserve separate analysis and different mitigation strategies:
Technical measures to prevent users from using the AI for particular tasks don’t help against the threat of the lab CEO trying to use the AI for those harmful tasks
Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models.
If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in plac...
Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who a...
Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.
A theory of how alignment research should work
(cross-posted from danielfilan.com)
Epistemic status:
Maybe obvious to everyone but me, or totally wrong (this doesn't really grapple with the challenges of working in a domain where an intelligent ...
When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:
basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities
I recently read "Targeted manipulation and deception emerge when optimizing LLMs for user feedback."
All things considered: I think this paper oversells its results, probably in order to advance the author(s)’ worldview or broader concerns about AI. I think it uses inflated language in the abstract and claims to find “scheming” where there is none. I think the experiments are at least somewhat interesting, but are described in a suggestive/misleading manner.
The title feels clickbait-y to me --- it's technically descriptive of their findings, but...
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
...No, that’s not how RL works. RL - in settings like REINFORCE for simplicity - provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
...More realistic evaluations are difficult without risking actual sabotage. One difficulty of running realistic evaluations is presenting realistic attack surfaces. Perhaps the hardest and most important attack surface to accurately model is the suspiciousness of expert humans in high-stakes decisions. To test a given set of realworld anti-sabotage mitigations, it’s tempting to introduce some sort of “penetration testing” or red-teaming, in which models are instructed to attempt to harmlessly but verifiably sabotage real decisions in a revers
OK, good point, I should have said "From Anthropic" or something like that.
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".
As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.
With these caveats:
The fact that an AI arms race would be extremely bad does not imply that rising global authoritarianism is not worth worrying about (and vice versa)
I am someone who is worried both about AI risks (from loss of control, and from war and misuse/structural risks) and from what seems to be a 'new axis' of authoritarian threats cooperating in unprecedented ways.
I won't reiterate all the evidence here, but these two pieces and their linked sources should suffice:
...Two thoughts about the role of quining in IBP:
You should update by +-1% on AI doom surprisingly frequently
This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and ...
I talked about this with Lawrence, and we both agree on the following:
I have some long comments I can't refind now (weirdly) about the difficulty of investing based on AI beliefs (or forecasting in general): similar to catching falling knives, timing is all-important and yet usually impossible to nail down accurately; specific investments are usually impossible if you aren't literally founding the company, and indexing 'the entire sector' definitely impossible. Even if you had an absurd amount of money, you could try to index and just plain fail - there is no index which covers, say, OpenAI.
Apropos, Matt Levine comments on o...
"Gwern, why don't you just buy an AI-themed ETF and 'buy the whole sector' if investing in individual stonks is so hard but you're optimistic about its long-term value?"
"How to Lose Money on the World’s Most Popular Investment Theme: Pity the investors in the three artificial-intelligence-themed ETFs that managed to lose money this year" (mirror):
...There are lots of embarrassing ways to lose money, but it is particularly galling to lose when you correctly identify the theme that will dominate the market and manage to buy into it at a good moment.
Pity the i
on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:
I may have slipped into a word game... are we "training against the [interpretability] detection method" or are we "providing feedback away from one kind of algorithm and towards another"? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?
This is why we need empirics.
Training Attack Policies for Control Evaluations via Prover-Verifier Games (aka Control with Training)
By @Ansh Radhakrishnan and @Buck
Credit goes to Buck for most of these ideas – most of this is written by Ansh, with some contributions from Buck. Thanks to John Schulman, Fabien Roger, Akbir Khan, Jiaxin Wen, Ryan Greenblatt, Aengus Lynch, and Jack Youstra for their comments.
The original AI control paper evaluates the robustness-to-intentional-subversion of safety techniques by having a red team construct an attack policy (a drop-in replac...
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams
Thanks for writing the post!
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).
By this definition, humans have shards because they can want food at the same time as wantin...
Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!
AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other
This confuses me.
I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.
Apply to the "Team Shard" mentorship program at MATS
In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing[1] to unsupervised capability elicitation. If you're theory-minded, maybe you'll help us formalize shard theory itself.
...Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Additional work has reinforced the promise of this technique, and steering vectors have become a sma