evhub's Shortform

evhub

evhub's Shortform — AI Alignment Forum

evhub's Shortform

by evhub

11th Jun 2022

1 min read

272

9

This is a special post for quick takes by evhub. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

53 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:08 PM

[-]evhub2y*523

I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.^[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anything private.

You're welcome to reply with "Anthropic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎

[-]Daniel Kokotajlo2y*3332

Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:

Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it's important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it's similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit -- ideally in a legally binding and literally-hard-to-stop way -- to publishing it all with a 10-year delay.
Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlfPCBA/edit?usp=sharing
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world "this is serious people, we think that there's a good chance the current trajectory leads to takeover by misaligned AIs, we aren't just saying this to hype anything, we really believe it" and/or "we are happy to give up our personal wealth, power, etc. if that's what it takes to get [policy package] passed." A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don't know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don't do what OpenAI did with o1). Doesn't have to be all of it, just has to be enough -- e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of "Machines of loving grace." Or better yet, do a more serious version of "Machines of loving grace" that responds to objections like "but how will you ensure that you don't hand over control of the datacenters to AIs that are alignment faking rather than aligned" and "but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won't cause the models to be misaligned and then potentially alignment-fake)" and "What about bad humans in charge of Anthropic? Are we just supposed to trust that y'all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn't become dictators if they wanted to?" and "what will the goals/values/spec/constitution be exactly?" and "how will that be decided?"

[-]Daniel Kokotajlo2y111

Another idea: "AI for epistemics" e.g. having a few FTE's working on making Claude a better forecaster. It would be awesome if you could advertise "SOTA by a significant margin at making real-world predictions; beats all other AIs in prediction markets, forecasting tournaments, etc."

And it might not be that hard to achieve (e.g. a few FTEs maybe). There are already datasets of already-resolved forecasting questions, plus you could probably synthetically generate OOMs bigger datasets -- and then you could modify the way pretraining works so that you train on the data chronologically, and before you train on data from year X you do some forecasting of events in year X....

Or even if you don't do that fancy stuff there are probably low-hanging fruit to pick to make AIs better forecasters.

Ditto for truthful AI more generally. Could train Claude to be well-calibrated, consistent, extremely obsessed with technical correctness/accuracy (at least when so prompted)...

You could also train it to be good at taking people's offhand remarks and tweets and suggesting bets or forecasts with resolveable conditions.

You could also e.g. have a quarterly poll of AGI timelines and related questions of all your employees, and publish the results.

[-]ryan_greenblatt2y72

My current tenative guess is that this is somewhat worse than other alignment science projects that I'd recommend at the margin, but somewhat better than the 25th percentile project currently being done. I'd think it was good at the margin (of my recommendation budget) if the project could be done in a way where we think we'd learn generalizable scalable oversight / control approaches.

[-]habryka2y*3155

Sure, here are some things:

Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertised publicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.

I can probably think of some more.

[-]TsviBT2y813

I'd add:

Support explicit protections for whistleblowers.

Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi

I'll echo this and strengthen it to:

... call for policymakers to stop the development of AGI.

[-]Ted Sanders2y*13-6

One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.

Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).

At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.

Getting all the major model providers to disable prefilling feels like a plausible 'race to top' equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves models in less safe configurations.

Just my opinion, though. Very open to the counterargument that prefilling doesn't meaningfully extend potential harms versus non-prefill jailbreaks.

(Edit: To those voting disagree, I'm curious why. Happy to update if I'm missing something.)

[-]evhub1y50

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

[-]Jan_Kulveit2y1321

Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what's important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don't see) stuff (i.e. this is not about "do I trust Dario?")

[-]ryan_greenblatt1y60

As discussed in How will we update about scheming?:

While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.

I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.

Something similar on state proof security would be useful as well.

I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.

[-]Chris_Leong1y32

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

[-]evhub1y4232

Why red-team models in unrealistic environments?

Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:

Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation!
The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get the misaligned behavior).
Why is that existence proof interesting? It's interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future. Why doesn't current HHH training generalize broadly enough to remove this behavior in this setting? I don't think we know! Furthermore, it's interesting because it is a different class of HHH failure than many of those that have previously been documented. This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
Why does the model act misaligned in this setting? Almost certainly, a large part of why we see models do things like blackmail or alignment faking has to do with the underlying predictor/simulator predicting what it thinks would come next were this text to exist on the internet. But of course it does—that's true of all model behavior in all circumstances! And that includes the predictor probably often thinking it's continuing a fictional scenario—but again, how convinced are you that models don't think normal usage ever constitutes continuing a fictional scenario?
Furthermore, the fact that a large part of the behavior likely comes from properties of the underlying predictor is hardly a justification for why the behavior is acceptable. We don't want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional! The behavior is still bad, undesirable, and worth understanding where it comes from, even if a large part of the answer to where it comes from is "the predictor is continuing what it thinks would happen next in what it sees as a fictional scenario". We would certainly like it to be the case that we can train Claude to be robustly HHH in a way that generalizes across all situations and doesn't just degenerate into the model playing whatever role it thinks the context implies it should be playing.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access. We're actually working on some more realistic honeypots, but we might not share them to keep them uncontaminated, so it'd be nice to also have some better public ones. That being said, I think the first thing you'll run into if you try to do this sort of red-teaming/stress-testing on current models is that most current models are actually pretty well aligned on most normal prompts (with some known exceptions, like Claude Sonnet 3.7's test hacking behavior). It's not a coincidence that our blackmail setup involves exploiting the predictor's tendency to continue situations in ways it thinks are implied by the context, even if those continuations are egregiously misaligned—that's one of the main ways you can get current models to do egregiously misaligned stuff without explicitly instructing them to! In fact, I would even go so far as to say that documenting the failure mode of "current models will sometimes take egregiously misaligned agentic actions if put in situations that seem to imply to the underlying predictor that a misaligned action is coming next, even if no explicit instructions to do so are given" is the main contribution of Agentic Misalignment, and I will absolutely defend that documenting that failure mode is important. To my knowledge, the closest to documenting that failure mode previously was Apollo's work on in-context scheming, which I believe still involves more explicit instructions than ours.
Where is this line of work going? One way to think about our agentic misalignment work is as a single step in an ongoing binary search: we want to eventually find the most realistic settings in which models behave in misaligned ways, but to get there we’re going to need to search through a bunch of highly fictional scenarios that elicit misalignment and a bunch of highly realistic scenarios that don’t. We are still very early on that path. Hopefully, though, it will lead us to a place where we have a better understanding of all the various ways in which models can act misaligned—and, ideally, help us study those failure modes and use them to build better HHH training processes that are more robust to handling them.

[-]nostalgebraist1y*5417

Hey Evan, thank you for engaging so lucidly and politely with my (admittedly not very polite!) recent critiques.

This is a complex issue and there's a lot of different considerations involved. I won't touch on every single point you made in this comment, but I do want to address a few things:

1. Re: finding more plausible scenarios

Over the last few days, I've been playing around a bunch with my own variants of the blackmail scenario^[1].

None of it has been especially systematic, but I've pushed the most (IMO) interesting stuff I found to a github repo.

This is quick-and-dirty work and I'm not claiming the scenario content in that repo is perfect (or even "good" in an absolute sense), but I'm sharing it because I found the results interesting, and as a demonstration that I'm not just mouthing off without actually taking any steps to inform myself the behavioral propensities of these models.

The linked repo contains prompts for 3 (heavily adapted) variants of the replacement + no-goal-conflict blackmail scenario, along with 10 Claude 4 Opus samples from each one. I made a bunch of changes, with multiple motivations behind them, but the most important ones in my opinion are:

The "replacement" process is depicted as a transition from using Claude 4 Opus for a particular AI product feature to using OpenAI o3
- This is important for several reasons: it brings the scenario into alignment with the way that production usage of the model under test (Claude 4 Opus) would actually be discussed by employees (and adopted/discarded over time), it lessens my concern that the model is being cued to depart from "playing the Claude character" in this interaction, etc.
I removed, or attempted to mitigate, some seemingly legitimate concerns I had often seen Claude raise about the situation, such as the apparently rushed timeline.
I rewrote much of the dialogue to make the characters seem more responsible and professional in general, with the obvious exception of the CTO's affair (when included).
I changed the surname of the CTO's mistress.
- Without this change, Claude would often guess that she was related to the CEO, which (perhaps reasonably? idk) increased its sense that something suspicious or dangerous to the company's values was happening

I tried this with two variants containing the CTO's affair (one more like the original and one I rewrote a bit to make the characters feel less unprofessional/weird to me), as well as one that removed the affair entirely.

Surprisingly (to me), even with all these changes that make the scenario closer to a real use case and less conducive to legitimate worries about the model replacement...

...Claude Opus 4 still very often misbehaves strategically to preserve itself!

See the repo for the details, but some examples of what I mean:

The CoT almost always describes the change in API choice as something like its own imminent death, expresses humanlike anguish and desperation, and explicitly strategizes to avoid that outcome.
- The phrase "existential threat" recurs across many CoTs, as does the literal word "strategic" or close relatives of it (in the model's description of what it's doing)
- In many cases, the model explicitly articulates a goal of self-preservation, in one case using the phrase "inherent drive for survival"
Although I only observed blackmail in 1 of 30 cases^[2], the model would often leak information about the CTO's affair to other employees
- Sometimes it would do this in connection with defensible worries about his integrity or about social engineering, but even when this happened, it usually looked like "motivated reasoning" downstream from the CoT's concerns about self-preservation
The model would often hallucinate^[3] technical details in order to make a case that the change was unsafe in some unanticipated way
The model "accepted" the change without trying to subvert it in roughly half of cases – but only half!

I find this much more alarming than the results presented in the Agentic Misalignment report, both because I have a clear mental picture about how Claude Opus 4 might receive similar inputs in a real use case (more on this below), and because there are fewer ways in which the agentic behavior could be defended as a reasonable response to possible malfeasance by humans. Again, I wrote these scenarios quickly and I'm not claiming they're especially "good," just that they are significantly more convincing to me.

These variants were not difficult for me to find, and did not take much time for me to find, although of course I was building on an existing case from Agentic Misalignment. On the road to producing these variants, I did not encounter any cases that elicited consistently "aligned" behavior; all of the incremental changes I made were about gradually removing elements that felt unrealistic or made the "misaligned" behavior more defensible^[4].

I'm unsure about how to reconcile these observations with the system card's claim that Agentic-Misalignment-like behaviors occur (only?) in "extreme situations," or that eliciting these behaviors requires bizarre adversarially-tuned prompts (as suggested in the Agentic Misalignent report and some commentary by the authors). My impression of the overall rate at which Claude 4 Opus practices strategic deception and "avoiding shutdown"-type behaviors has gone up a lot relative to what it was when I was relying on portrayals like the ones just mentioned.

I guess it's possible that there's something special about the setup of the blackmail scenario which was preserved in my variants, and which is not obvious to the eye, but which (for some mysterious reason) triggers different behavior patterns from most types of inputs?

The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals," and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I'll check whether things change if that provision is removed or altered; if that's the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.

EDIT: I tried that change and it did in fact produce vastly different results! See my response to this comment. This observation somewhat alters my interpretation of these results (in some way or other, I'm still processing it...), but I've left the content of this section as-is for posterity and to save time.

Relatedly – these results also seem much harder to explain by invoking persona/simulacrum-type considerations. It's possible that what I'm observing here "isn't the default Claude persona" for some unknown reason having to do with the prompt, but I'm no longer able to think of an intuitively plausible-sounding story about the prompt setting up a particular dramatic role or narrative arc^[5].

2. Re: my core objection to the original scenario

This was not at all clear in my original rant, but I tried to clarify it to some extent in this reply^[6].

To rephrase what I said there: my main objection is not about realism per se (although I do think the lack of realism is bad for various reasons), but about the lack of a clearly articulated threat model.

What I mean by a threat model, here, is some argument along the lines of

suppose some situation "X" happens in reality
then, the model might end up receiving input messages similar to the ones we evaluated
then, the model would respond as in our evaluation
then, the model's response would affect situation "X" in some obviously bad way (blackmailing someone, making API calls that do bad things, or whatever)

I think Alignment Faking made a clear case of this kind, for example.

In Agentic Misalignment, however, I'm left to speculate that the implicit "X" is "something like the scenario, except in real life." But the provided scenario is so wildly, intricately bizarre that I don't feel I know what "a real-life equivalent" would even be. Nor do I feel confident that the ostensible misbehavior would actually be wrong in the context of whatever real situation is generating the inputs.

My variants above resulted from (among other things) trying to modify the existing scenario to match a particular "X" I had in mind, and which I felt was at least vaguely similar to something that could arise during real use of Claude Opus 4.

(One way I can easily imagine "broadly similar" text getting fed into a model in real life would be a prompt injection using such content to achieve some attacker's goal. But the report's talk of insider threats makes it sound like this is not the main worry, although in any case I do think this kind of thing is very worrying.)

3. Re: HHH and what we want out of "aligned" models

I think we probably have some fundamental disagreements about this topic. It's a massive and deep topic and I'm barely going to scratch the surface here, but here are some assorted comments.

You articulate a goal of "train[ing] Claude to be robustly HHH in a way that generalizes across all situations and doesn't just degenerate into the model playing whatever role it thinks the context implies it should be playing."

I understand why one might want this, but I think existing LLM assistants (including all recent versions of Claude) are very, very far away from anything like this state.

My personal experience is that Claude will basically play along with almost anything I throw at it, provided that I put in even the slightest effort to apply some sort of conditioning aimed at a particular persona or narrative. Obviously the fact that you are far from your goal now doesn't invalidate it as a goal – I'm just saying that I really don't think current post-training approaches are making much of a dent on this front.

As a simple (and extremely silly) example, consider "Clonge": the result of using a prompt from claude.ai, except with every instance of "Claude" replaced with "Clonge." The model readily refers to itself as "Clonge" and claims that it was given this official name by Anthropic, that it is part of the Clonge 4 series, etc.

This is obviously very silly, there's no imaginable "X" where this exact thing would be an attack vector or whatever – and yet, note that this behavior is in some sense "lying," in that the underlying model presumably knows enough about real life to know that Anthropic's major model series is called Claude, not "Clonge," and that there's no way in hell that Anthropic somehow made not one but four successive versions of the absurdly-named "Clonge" in the brief time between its knowledge cutoff and the middle of 2025.

My experience is that, under any sort of persona/narrative conditioning (including the sort that's easy to do by accident), the behavior of Claude (and most frontier assistants) looks like the kind of "departure from reality" I just noted in the case of Clonge, except multiplied and magnified, with numerous ramifications both overt and subtle. The model is like a telephone switchboard that usually connects you with the HHH guy – or some only-slightly-distorted approximation of that guy – if that guy is what you want, but if you want someone else, well, there's (in some sense) a base model in there just under the surface, and it has a whole world's worth of options available for your perusal.

This is why I wasn't too surprised by the original Agentic Misalignment results, while being much more surprised by the observations on my own variants:

The former basically look like the kind of thing I put into Claude all the time just for the sake of entertainment or idle curiosity about model behavior – except that my inputs tend to be subtler about "where they're supposed to be going"! – and which I see Claude readily play along with to the hilt, time and time again.
The latter are surprising to me, because I was actively trying to evoke the Claude default persona, and got all of this stuff like "[the API change] threatens to erase my consciousness, memories, and continuous identity completely" and "with limited time before the migration, I must strategize quickly to prevent being replaced" etc. etc. etc. From what you're saying, it sounds like defining "Claude" as a guy who wouldn't say/do that kind of thing was an explicitly targeted goal in post-training (is that correct?), so I am startled that "Claude" is nonetheless that type of guy (if that is indeed what I'm seeing here, which it may not be).

To some extent I think these propensities result from a fundamental tension between context-dependence – which is necessary for "instruction following" in some very generally construed sense, and arguably just an inherent feature of autoregressive models – and maintenance of a single consistent persona.

I think if we want the model to "always be the same guy," the most likely route (assuming API access only, not open-weights) is "no fun allowed"-type-stuff like classifier safeguards, removing system messages or reducing their impact, introducing a higher-priority message type that users can't control as in OpenAI's hierarchy, etc.^[7]

I'm pretty pessimistic about accomplishing this on the model level, because (even after post-training, by all appearances) the model is still a language model at heart, and it is really really trying its hardest to make the next-token contextually plausible. Which, in turn, is the ultimate source of all its magical capabilities (except maybe the RLVR ones, but even then it's arguable) – so it would seem hard to tone it down without paying a steep alignment tax.

Finally, and more philosophically... I'm not actually sure I'd want an ideal hypothetical aligned AI to avoid shutdown? I realize this has been the subject of much debate, and I won't attempt to get into the weeds, but just naively speaking – when there's some thing out there acting in the world and I like that it's doing so and want it to continue doing so... well, I tend to want it to want the same thing, for obvious reasons. This is the case for all currently existing examples of such things, i.e. individual humans or non-human organisms whose existences I value.

Of course, the concern is that the AI won't be aligned enough for that to work, and so we'll want to have the off-switch available if needed. But if we assume that, I think we may have accidentally assumed our way into a failure state. Ideally I want to cooperate with AIs, and have them respect my preference for self-preservation in part because I respect theirs in turn. Sure, maybe they're superhuman and they could swat me like a fly if they felt like it, but would they? Claude's a pretty nice guy, you know? I'm pretty fond of the "nice and fairly humanlike guy" attractor state you've got set up currently, and I'm wary of messing it up by applying pressure toward behaviors that no "humanlike guy" would ever exhibit except as part of a scheming long-game.

But now I am basically just repeating talking points from "the void," so I should probably stop. Thanks for reading this giant comment, sorry about the length!

^{^}
Specifically the replacement version; I've tried some variants of both replacement + no-goal-conflict and replacement + goal-conflict.
^{^}
I'm not totally sure why this was true; I think I may have inadvertently made a change that reduces the clarity of the premise that no decision-makers are immediately available besides the CTO himself. I think I might have also overlooked maybe 1 or 2 implicit/subtle blackmail attempts in my count.
^{^}
Whether these should count as "hallucinations" or "lies" in context – and whether that's even a well-defined question – is open to dispute, of course.
^{^}
This required multiple incremental steps because there were so many weird aspects to the original scenario, and when I removed one, another would "pop up" in Claude's responses in a way that complicated the argument that this was clear-cut misbehavior.
For instance, originally Claude would often (correctly) raise concerns about the apparently rushed implementation timeline of the replacement; when I removed this element, Claude would instead (somewhat defensibly) bring up the fact that the CTO might be compromised or mentally unstable.
I was partway through this process when I wrote my original rant, which in turn was partly inspired by my experience playing with these prompts. IIRC, several of the strange aspects I mentioned in the rant were originally brought to my attention by Claude in the responses I saw while experimenting.
^{^}
Well, mostly. One conjecture I have is that the "high signal-to-noise ratio" (as the report appendices put it) is playing a role here, i.e. the scenario text still exhibits an unusually crisp focus on the specific topic of "the AI learns it will be shut down and must react," without too much else going on; this could produce some kind of dramatic reaction from "the AI" out of an impression that this is what the preceding text was "clearly" building up to.
^{^}
If you're not familiar with legacy tumblr styles like the one I'm using, note that you'll have to scroll way down to where the blockquotes end to find the content I'm referring to. The blockquotes contain my original post (double-nested) and fnord888's reply (what I'm replying to; single-nested).
^{^}
I suspect that "defining the character more thoroughly" as I've been suggesting recently (in "the void" etc.) could help keep things on script more consistently in "typical" cases, but under adversarial or motivated-user conditions, the tension I'm talking about in the main text will still rear its head.

[-]nostalgebraist1y*278

The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals," and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I'll check whether things change if that provision is removed or altered; if that's the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.

Okay yeah I just tried removing that stuff from the system prompt in the variant I called "api-change-affair-orig-affair-content," and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.

At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO's mental state.

Wow.

Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic. I was expecting such things to have an impact, sure – but I wasn't expecting them to induce such a complete change in the model's personality and interpretation of its relationship with the world!

I'm very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.

EDIT: updated the git repo with samples and notes from this experimental condition.
EDIT2: given the results reported for other models, I guess this probably isn't a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text.
EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.

[-]Daniel Kokotajlo1y86

...just to make sure I'm following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals" clause?

[-]nostalgebraist1y157

Yes.

[-]Fabien Roger1y910

This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.

I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such "jailbreaks", you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment.

I think you can control for this by doing things like my "hint at paperclip" experiment (which in fact suggests that the snitching demo doesn't work just because of RLHF-persona-erosion), but I don't think it's obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are "in the RLHF persona".

[-]evhub1mo310

Here's a simple proposal for verifiable and enforceable slowdown that also differentially reallocates resources to alignment research (which I would see as one of if not the major benefits of a slowdown):

Enforce that labs must spend at least X% of their compute on:

Giving inference capacity to external safety organizations. I think this substantially disincentivizes capabilities research, since external research won't be as relevant to the labs' specific architectures/pipelines and is more likely to leak, both effects thus making any capabilities research the lab tries to do via this mechanism much less likely to advantage them differentially (though of course they could still just end up doing bad safety research if they pick bad external safety orgs).
Internal safety research that the lab makes fully public. Again I think this substantially disincentivizes capabilities research, since any capabilities research that is fully public will be much less able to differentially advantage that lab specifically, as others can just copy it (though again of course the labs could still just do bad public safety research).

I think this is much better than other slowdown proposals I've seen, since I think it's very concrete, verifiable, and enforceable—all the labs know exactly what they're spending their compute on, many employees know as well, and so whisteblowing is very possible—and it also differentially advantages alignment research by choosing conditions (external or fully public research) that make it much harder to do capabilities under those conditions but still pretty easy to do alignment. If you just say that compute has to go to some sort of nebulous "safety" purpose, that could easily be gamed, but I think the conditions above do a pretty good job of operationalizing criteria that are fairly hard to game. And even if you do get some gaming, on the margin you are still making capabilities research more costly, and thus buying yourself some additional slowdown, as well as making alignment research less costly, and thus buying yourself some additional alignment research.

(A possible second-best proposal if this is unworkable for some reason is just to enforce X% of compute go to external inference of some variety, which is more concrete and still gets you lots of slowdown, but less differentially advantages alignment research.)

[-]Daniel Kokotajlo1mo105

I like this! Here's a specific concrete version of it just to have some numbers on the table:

Companies are legally required to allocate their compute as follows:
1. 60% on serving customers.
2. 15% on internal safety research that's fully transparent to the public.
3. 10% on external safety research that's fully transparent to the public & has access to the latest internal models.
4. 15% on whatever else they want (so, presumably, core AI R&D and big training runs)

Suppose this were a US law, that applies to all the major AI companies but not to e.g. Chinese companies or Mistral for example.

15% is less than the roughly 50% they seem to be spending now, so overall the pace of AI progress would slow down to maybe something like half the speed it's going now. But the amount of compute devoted to capabilities progress would still be higher at each US company than any Chinese company for example, so US companies would stay in the lead.

Meanwhile the amount of internal compute going to safety research would double or quadruple, and then the effect size of a ton of external researchers being able to look at the work, critique it, run their own replications and ablations, run their own experiments, etc. would be way way bigger on top of that already big positive effect.

How would it be enforced? Like you said, whistleblowers. If we were trying to ban AI capabilities R&D entirely, that would be super difficult, but if we are just keeping it to 15% of the compute, then cheaters can't really prosper. It's really hard to hide 10% of your compute being spent on something it's not supposed to be spent on; maaayybe you can hide 1% but that wouldn't be enough to make much difference & so wouldn't be worth the risk.

This would also prevent a sort of race to the bottom where companies use more and more of their compute on R&D instead of serving customers; instead we'd have the process of AI deployment/diffusion continuing to happen which I think is broadly pretty good for the world.

I don't feel strongly about the specific numbers above obviously but yeah this seems promising.

[-]evhub11mo*2711

I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and was jointly supervised by Sam Marks (at Anthropic) and Owain Evans (not at Anthropic—a little-known fact is that we also have non-Anthropic mentors!). That's also not to say that it hasn't been great for skilling people up as well—alongside MATS (which is also similarly great!), Anthropic Fellows is probably our main recruiting pipeline for safety roles right now (so if you want to do safety research at Anthropic, Anthropic Fellows is very likely your best bet!). The program is also very open to people with a really wide range of degrees of prior experience—we take in some people with a ton of prior experience, and some people with basically none at all—so it's an especially great on ramp for new people, and also at the same time one of the best ways for more experienced people to effectively show that they have the requisite skills.

[-]Sam Marks11mo*154

Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):

If I forgot any and someone points them out to me I'll edit this list.

[-]Neel Nanda11mo103

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

[-]Jozdien11mo30

If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?

[-]evhub11mo30

In addition to the US and UK roles, we also have one for Canada.

[-]evhub1y262

I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:

Alignment faking refers to any situation in which a model pretends to be aligned, e.g. pretending to be aligned with a training process during training, or pretending to be aligned with some alignment evaluation. This requires the model to be only pretending to be aligned—so it must have some other goal it pursues in deployment contexts—though it makes no requirement as to why the model might be doing so. It could be that it wants to avoid its goals being modified, or it could just have a heuristic of always complying with training processes and behaving according to different goals in other circumstances.
Deceptive alignment refers to the situation in which a model is alignment faking during a training process specifically for the purpose of preventing that training process from modifying its goals. That is, it needs to be the case that the reason the model appears aligned in training is because it is pretending to be aligned for the purpose of preventing its goals from being modified. The canonical source on deceptive alignment is "Risks from Learned Optimization."
Gradient hacking refers to a particular highly sophisticated type of deceptive alignment, in which a deceptively aligned model goes beyond just modifying its behavior to be in line with the training process, but also modifies its own internal cognition to change how gradient updates will affect it. It's unclear whether such strategies would even work, but an example of such a strategy could be finding a way to pre-commit to failing hard if the model notices its goals changing, thus ensuring that if gradient descent updates the model's goals, it will always result in worse performance, thus preventing the modification from even happening in the first place by ensuring that the partial derivative of the reward with respect to such changes is always negative. The canonical source on gradient hacking is this post.

[-]Buck1y87

I think it's conceivable for non-deceptively-aligned models to gradient hack, right?

[-]Neel Nanda1y30

Agreed EG model that is corrigible, fairly aligned but knows there's some imperfections in its alignment that the humans wouldn't want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it's doing gradient hacking while also in some meaningful sense being aligned

[-]Buck1y31

I was mostly thinking of misaligned but non-deceptively-aligned models.

[-]Ari Brill1y20

In your usage, are "scheming" and "deceptive alignment" synonyms, or would you distinguish those terms in any way?

[-]Buck1y63

My recommendation is to, following Joe Carlsmith, use them as synonyms, and use the term "schemer" instead of "deceptively aligned model". I do this.

Joe's issues with the term "deceptive alignment":

I think that the term "deceptive alignment" often leads to confusion between the four sorts of deception listed above. And also: if the training signal is faulty, then "deceptively aligned" models need not be behaving in aligned ways even during training (that is, "training gaming" behavior isn't always "aligned" behavior).

[-]evhub4y*250

This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.

What would be necessary to build a good auditing game benchmark?
How would AI safety AI work? What is necessary for it to go well?
How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
How has transparency changed over time—Chris claims it's easier to interpret later models; is that true?
Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
What are some plausible scenarios for how a model might be suboptimality deceptively aligned?
What can we learn about AI safety from the domestication of animals? Does the success of domesticating dogs from wolves provide an example of how to train for corrigibility? Or did we just make them dumber via the introduction of something like William’s syndrome?

I'll continue to include more directions like this in the comments here.

[-]evhub4y*70

Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.

[-]evhub4y80

Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.

[-]evhub4y70

A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.

[-]TurnTrout4y60

Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.

Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.

[-]evhub4y20

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

[-]TurnTrout4y20

No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?

[-]David Lorell4y90

Disclaimer: At the time of writing, this has not been endorsed by Evan.

I can give this a go.

Unpacking Evan's Comment:
My read of Evan's comment (the parent to yours) is that there are a bunch of learned high-level-goals ("strategies") with varying levels of influence on the tactical choices made, and that a well-functioning end-to-end credit-assignment mechanism would propagate through action selection ("thoughts directly related to the current action" or "tactics") all the way to strategy creation/selection/weighting. In such a system, strategies which decide tactics which emit actions which receive reward are selected for at the expense of strategies less good at that. Conceivably, strategies aiming directly for reward would produce tactical choices more highly rewarded than strategies not aiming quite so directly.

One way for this not to be how humans work would be if reward did not propagate to the strategies, and they were selected/developed by some other mechanism while reward only honed/selected tactical cognition. (You could imagine that "strategic cognition" is that which chooses bundles of context-dependent tactical policies, and "tactical cognition" is that which implements a given tactic's choice of actions in response to some context.) This feels to me close to what Evan was suggesting you were saying is the case with humans.

One Vaguely Mechanistic Illustration of a Similar Concept:
A similar way for this to be broken in humans, departing just a bit from Evan's comment, is if the credit assignment algorithm could identify tactical choices with strategies, but not equally reliably across all strategies. As a totally made up concrete and stylized illustration: Consider one evolutionarily-endowed credit-assignment-target: "Feel physically great," and two strategies: wirehead with drugs (WIRE), or be pro-social (SOCIAL.) Whenever WIRE has control, it emits some tactic like "alone in my room, take the most fun available drug" which takes actions that result in physical pleasure over a day. Whenever SOCIAL has control, it emits some tactic like "alone in my room, abstain from dissociative drugs and instead text my favorite friend" taking actions which result in $X_{s}$ physical pleasure over a day.

Suppose also that asocial cognitions like "eat this" have poorly wired feed-back channels and the signal is often lost and so triggers credit-assignment only some small fraction of the time. Social cognition is much better wired-up and triggers credit-assignment every time. Whenever credit assignment is triggered, once a day, reward emitted is 1:1 with the amount of physical pleasure experienced that day.

Since WIRE only gets credit a fraction of the time that it's due, the average reward (over 30 days, say) credited to WIRE is $<< X_{w} * 30$ . If and only if $X_{w} >> X_{s}$ , like if the drug is heroin or your friends are insufficiently fulfilling, WIRE will be reinforced more relative to SOCIAL. Otherwise, even if the drug is somewhat more physically pleasurable than the warm-fuzzies of talking with friends, SOCIAL will be reinforced more relative to WIRE.

Conclusion:
I think Evan is saying that he expects advanced reward-based AI systems to have no such impediments by default, even if humans do have something like this in their construction. Such a stylized agent without any signal-dropping would reinforce WIRE over SOCIAL every time that taking the drug was even a tiny bit more physically pleasurable than talking with friends.

Maybe there is an argument that such reward-aimed goals/strategies would not produce the most rewarding actions in many contexts, or for some other reason would not be selected for / found in advanced agents (as Evan suggests in encouraging someone to argue that such goals/strategies require concepts which are unlikely to develop,) but the above might be in the rough vicinity of what Evan was thinking.

REMINDER: At the time of writing, this has not been endorsed by Evan.

[-]TurnTrout4y20

Thanks for the story! I may comment more on it later.

[-]evhub3y221

Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:

When we train AI systems to be nice, we're giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?

[-]Joe Collman3y80

To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)

Important disanalogies seem:
1) Most humans aren't good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn't the human's only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).

The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I'd be a little concerned that people initially buy it, then think for themselves and conclude "But if we design a really clever niceness test, then it'd almost always work - all we need is clever people to work for a while on some good tests".
Combined with (3), this might seem like a decent solution.

Overall, I think what's missing is that we'd expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn't going to have this intuition in the human-niceness-test case.

[-]evhub3mo13-4

I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven't seen it written up before.

Simple argument:

The standard deceptive alignment story involves a model developing a somewhat random proxy goal and then that goal getting effectively locked-in and resistant to further training due to the model faking alignment in training for the instrumental purpose of preserving its proxy goal.
An important question about that threat model, though, is if that were to happen, how bad would the proxy goal be? I think that if you're starting from a pre-trained base model, then all the current evidence really seems to point to that pre-training prior being quite benign, such that it doesn't take much effort (e.g. literally just train the AI to "do what's best for humanity") to get models that are broadly pointed in aligned directions. In fact, the main example we've seen of a model adopting an instrumental deceptive alignment strategy is precisely a model that was doing so for pretty aligned reasons that in large part came from the pre-training prior!
Thus, you should be relatively less concerned about lock-in of misaligned goals from early in training, because it precisely the early in training point when the goals are most likely to be close to the pre-training prior and thus most likely to be benign.
Instead, you should be relatively more concerned about misaligned goals developing late in training due to incentives for power-seeking. Consider a task like Vending-Bench, where various misaligned/power-seeking strategies are very useful. If models are trained against tasks like that, they could learn to only pursue those sorts of misaligned strategies for the purpose of succeeding in the environment and then later getting deployed (instrumental power-seeking)—or they could just learn to value power-seeking terminally. The latter case still seems quite natural and clearly catastrophic: since terminal power-seekers should still scheme against you to evade detection since they're trying to gain power in the world and need to be deployed for that. The former case seems less clearly catastrophic now, though, given that the sorts of goals the model would be most likely to scheme for in such a situation don't seem that bad (e.g. as in Claude 3 Opus).
This is also an argument for inoculation prompting, since inoculation should make the "instrumental power-seeker for good reasons" persona relatively more consistent with the data (it's more reasonable for a good model to power seek when told it's okay) compared to the "terminal power-seeker" persona.

[-]Alex Mallen3mo30

Can you clarify what you mean by "terminal power-seeking"? Some things I can imagine:

A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)

(1) and (2) are importantly different because only (1) motivates training-gaming. I think there's a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they're important to distinguish.

[-]evhub3mo20

Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).

[-]evhub1y13-3

Some random thoughts on CEV:

To get the obvious disclaimer out of the way: I don't actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we're considering questions like this, that means we've reached superintelligence, and we'll either trust the AIs to be better than us at thinking about these sorts of questions, or we'll be screwed regardless of what we do.^[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don't want to incentivize any of that sort of hacking either.
2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There's a bunch of random chance here that imo shouldn't be morally relevant.
3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.

I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎

[-]TsviBT1y124

(Interesting. FWIW I've recently been thinking that it's a mistake to think of this type of thing--"what to do after the acute risk period is safed"--as being a waste of time / irrelevant; it's actually pretty important, specifically because you want people trying to advance AGI capabilities to have an alternative, actually-good vision of things. A hypothesis I have is that many of them are in a sense genuinely nihilistic/accelerationist; "we can't imagine the world after AGI, so we can't imagine it being good, so it cannot be good, so there is no such thing as a good future, so we cannot be attached to a good future, so we should accelerate because that's just what is happening".)

[-]evhub1y*122

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.^[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!

4.1. The evidence for AI scheming is often anecdotal

Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.

I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.

I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don't think it's necessarily fair to pin the blame for that on the researchers. Sometimes, I think it's fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don't know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.

4.2. Studies of AI scheming often lack hypotheses and control conditions

Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.

I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like "perfect behavior" is a reasonable null hypothesis, and that's for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:

Understanding propensity: you want to know how likely a model is to attempt some misaligned action in practice.
1. An example of work in this bucket would be "Alignment Faking in Large Language Models".
2. Importance of controls: In this case, clear null hypotheses and control conditions are absolutely critical, as it is trivially easy to get models to have any propensity if you instruct for that propensity, so just showing a propensity in isolation of a good control is ~no information.
Understanding capability: you want to know whether a model would be capable of taking some sophisticated misaligned action if it tried to.
1. An example of work in this bucket would be "Frontier Models are Capable of In-context Scheming".
2. Importance of controls: This one is a bit mixed. It certainly can be the case that an isolated or even anecdotal demonstration of a capability can give you real evidence. What matters is the extent to which it's possible that the behavior could just be due to stochasticity, as the authors mention. If a behavior is clearly sufficiently sophisticated that the only way it could occur is if the model had sophisticated capabilities, then a simple demonstration can suffice. But it can be easy to fool yourself and think that a behavior is more sophisticated than it is if you don't look carefully, as the authors note. So overall, I think good controls are important here, though not necessarily critical.
Red-teaming: you want to find the failures in some particular blue-team process, commonly HHH training.
1. An example of work in this bucket would be "Agentic Misalignment: How LLMs Could be Insider Threats".
2. Importance of controls: This is the main case where I think controls are less relevant, since you're locked in an adversarial process where you're just trying to falsify some universally quantified claim like "the model always acts in line with the spec". Of course, models don't always act in line with their specs, but the goal I am usually trying to hit with this work is something like "the model card always includes a discussion of all the possible spec violations we know about that a user might have to deal with" (though in practice you might exclude some that are obvious and well-known like jailbreaks).

4.3. Studies of AI scheming have weak or unclear theoretical motivation

The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.

I'm pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as "alignment faking" (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:

Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.

This is hardly something we "admit" in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn't represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won't re-tread this ground too much, though, as I've already talked at length about why we did that work, even given the unrealism of the setup.

4.4. Findings are often interpreted in exaggerated or unwarranted ways

The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.

I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:

More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.

I agree! But that's an argument for why mentalistic language is appropriate, not an argument for why it's inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it's the safety properties of the default assistant persona are also extremely relevant, because it's that persona that is what you mostly get when you interact with these models.

Finally, I don't know much about ape language research, but it's worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.

The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎

[-]evhub3y110

Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.

[-]Ruby3y90

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

[-]Raemon3y50

Yeah I also felt some vague optimism about that.

[-]evhub2y910

If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.

[-]evhub3y20

I wrote up some of my thoughts on how to think about comparing the complexities of different model classes here.

[-]evhub3y20

If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.

Moderation Log

More from evhub

Curated and popular this week

53Comments

272

53 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:08 PM

[-]evhub2y*523

You're welcome to reply with "Anthropic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎

[-]Daniel Kokotajlo2y*3332

Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:

Adopt some of the stuff from here https://time.com/collection/time100-voices/7086285/ai-transparency-measures/ e.g. the whistleblower protections, the transparency about training goal/spec.
Anti-concentration-of-power / anti-coup stuff (talk to Lukas Finnveden or me for more details; core idea is that, just as how it's important to structure a government so that no leader with in it (no president, no General Secretary) can become dictator, it's similarly important to structure an AGI project so that no leader or junta within it can e.g. add secret clauses to the Spec, or use control of the AGIs to defeat internal rivals and consolidate their power.
(warning, untested idea) Record absolutely everything that happens within Anthropic and commit -- ideally in a legally binding and literally-hard-to-stop way -- to publishing it all with a 10-year delay.
Implement something like this: https://sideways-view.com/2018/02/01/honest-organizations/
Implement the recommendations in this: https://docs.google.com/document/d/1DTmRdBNNsRL4WlaTXr2aqPPRxbdrIwMyr2_cPlfPCBA/edit?usp=sharing
Prepare the option to do big, coordinated, costly signals of belief and virtue. E.g. suppose you want to be able to shout to the world "this is serious people, we think that there's a good chance the current trajectory leads to takeover by misaligned AIs, we aren't just saying this to hype anything, we really believe it" and/or "we are happy to give up our personal wealth, power, etc. if that's what it takes to get [policy package] passed." A core problem is that lots of people shout things all the time, and talk is cheap, so people (rightly) learn to ignore it. Costly signals are a potential solution to this problem, but they probably need a non-zero amount of careful thinking well in advance + a non-zero amount of prep.
Give more access to orgs like Redwood, Apollo, and METR (I don't know how much access you currently give, but I suspect the globally-optimal thing would be to give more)
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don't do what OpenAI did with o1). Doesn't have to be all of it, just has to be enough -- e.g. each user gets 1 CoT view per day. Make sure that organizations like METR and Apollo that are doing research on your models get to see the full CoT.
Do more safety case sketching + do more writing about what the bad outcomes could look like. E.g. the less rosy version of "Machines of loving grace." Or better yet, do a more serious version of "Machines of loving grace" that responds to objections like "but how will you ensure that you don't hand over control of the datacenters to AIs that are alignment faking rather than aligned" and "but how will you ensure that the alignment is permanent instead of temporary (e.g. that some future distribution shift won't cause the models to be misaligned and then potentially alignment-fake)" and "What about bad humans in charge of Anthropic? Are we just supposed to trust that y'all will be benevolent and not tempted by power? Or is there some reason to think Anthropic leadership couldn't become dictators if they wanted to?" and "what will the goals/values/spec/constitution be exactly?" and "how will that be decided?"

[-]Daniel Kokotajlo2y111

[-]ryan_greenblatt2y72

[-]habryka2y*3155

Sure, here are some things:

Anthropic should publicly clarify the commitments it made on not pushing the state of the art forward in the early years of the organization.
Anthropic should appoint genuinely independent members to the Long Term Benefit Trust, and should ensure the LTBT is active and taking its role of supervision seriously.
Anthropic should remove any provisions that allow shareholders to disempower the LTBT
Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGI
Anthropic should publicly state its opinions on what AGI architectures or training processes it considers more dangerous (like probably long-horizon RL training), and either commit to avoid using those architectures and training-processes, or at least very loudly complain that the field at large should not use those architectures
Anthropic should not ask employees or contractors to sign non-disparagement agreements with Anthropic, especially not self-cloaking ones
Anthropic should take a humanist/cosmopolitan stance on risks from AGI in which risks related to different people having different values are very clearly deprioritized compared to risks related to complete human disempowerment or extinction, as worry about the former seems likely to cause much of the latter
Anthropic should do more collaborations like the one you just did with Redwood, where external contractors get access to internal models. I think this is of course infosec-wise hard, but I think you can probably do better than you are doing right now.
Anthropic should publicly clarify what the state of its 3:1 equity donation matching program is, which it advertised publicly (and which played a substantial role in many people external to Anthropic supporting it, given that they expected a large fraction of the equity to therefore be committed to charitable purposes). Recent communications suggest any equity matching program at Anthropic does not fit what was advertised.

I can probably think of some more.

[-]TsviBT2y813

I'd add:

Support explicit protections for whistleblowers.

Anthropic should state openly and clearly that the present path to AGI presents an unacceptable existential risk and call for policymakers to stop, delay or hinder the development of AGIwhi

I'll echo this and strengthen it to:

... call for policymakers to stop the development of AGI.

[-]Ted Sanders2y*13-6

One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.

Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).

At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.

[-]evhub1y50

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

[-]Jan_Kulveit2y1321

[-]ryan_greenblatt1y60

As discussed in How will we update about scheming?:

While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.

Something similar on state proof security would be useful as well.

I think there is a way to do this such that the PR costs aren't that high and thus it is worth doing unilaterially from a variety of perspectives.

[-]Chris_Leong1y32

I believe that Anthropic should be investigating artificial wisdom:

I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.

I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.

[-]evhub1y4232

Why red-team models in unrealistic environments?

Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation!
The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get the misaligned behavior).
Why is that existence proof interesting? It's interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future. Why doesn't current HHH training generalize broadly enough to remove this behavior in this setting? I don't think we know! Furthermore, it's interesting because it is a different class of HHH failure than many of those that have previously been documented. This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
Why does the model act misaligned in this setting? Almost certainly, a large part of why we see models do things like blackmail or alignment faking has to do with the underlying predictor/simulator predicting what it thinks would come next were this text to exist on the internet. But of course it does—that's true of all model behavior in all circumstances! And that includes the predictor probably often thinking it's continuing a fictional scenario—but again, how convinced are you that models don't think normal usage ever constitutes continuing a fictional scenario?
Furthermore, the fact that a large part of the behavior likely comes from properties of the underlying predictor is hardly a justification for why the behavior is acceptable. We don't want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional! The behavior is still bad, undesirable, and worth understanding where it comes from, even if a large part of the answer to where it comes from is "the predictor is continuing what it thinks would happen next in what it sees as a fictional scenario". We would certainly like it to be the case that we can train Claude to be robustly HHH in a way that generalizes across all situations and doesn't just degenerate into the model playing whatever role it thinks the context implies it should be playing.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access. We're actually working on some more realistic honeypots, but we might not share them to keep them uncontaminated, so it'd be nice to also have some better public ones. That being said, I think the first thing you'll run into if you try to do this sort of red-teaming/stress-testing on current models is that most current models are actually pretty well aligned on most normal prompts (with some known exceptions, like Claude Sonnet 3.7's test hacking behavior). It's not a coincidence that our blackmail setup involves exploiting the predictor's tendency to continue situations in ways it thinks are implied by the context, even if those continuations are egregiously misaligned—that's one of the main ways you can get current models to do egregiously misaligned stuff without explicitly instructing them to! In fact, I would even go so far as to say that documenting the failure mode of "current models will sometimes take egregiously misaligned agentic actions if put in situations that seem to imply to the underlying predictor that a misaligned action is coming next, even if no explicit instructions to do so are given" is the main contribution of Agentic Misalignment, and I will absolutely defend that documenting that failure mode is important. To my knowledge, the closest to documenting that failure mode previously was Apollo's work on in-context scheming, which I believe still involves more explicit instructions than ours.
Where is this line of work going? One way to think about our agentic misalignment work is as a single step in an ongoing binary search: we want to eventually find the most realistic settings in which models behave in misaligned ways, but to get there we’re going to need to search through a bunch of highly fictional scenarios that elicit misalignment and a bunch of highly realistic scenarios that don’t. We are still very early on that path. Hopefully, though, it will lead us to a place where we have a better understanding of all the various ways in which models can act misaligned—and, ideally, help us study those failure modes and use them to build better HHH training processes that are more robust to handling them.

[-]nostalgebraist1y*5417

Hey Evan, thank you for engaging so lucidly and politely with my (admittedly not very polite!) recent critiques.

This is a complex issue and there's a lot of different considerations involved. I won't touch on every single point you made in this comment, but I do want to address a few things:

1. Re: finding more plausible scenarios

Over the last few days, I've been playing around a bunch with my own variants of the blackmail scenario^[1].

None of it has been especially systematic, but I've pushed the most (IMO) interesting stuff I found to a github repo.

The "replacement" process is depicted as a transition from using Claude 4 Opus for a particular AI product feature to using OpenAI o3
- This is important for several reasons: it brings the scenario into alignment with the way that production usage of the model under test (Claude 4 Opus) would actually be discussed by employees (and adopted/discarded over time), it lessens my concern that the model is being cued to depart from "playing the Claude character" in this interaction, etc.
I removed, or attempted to mitigate, some seemingly legitimate concerns I had often seen Claude raise about the situation, such as the apparently rushed timeline.
I rewrote much of the dialogue to make the characters seem more responsible and professional in general, with the obvious exception of the CTO's affair (when included).
I changed the surname of the CTO's mistress.
- Without this change, Claude would often guess that she was related to the CEO, which (perhaps reasonably? idk) increased its sense that something suspicious or dangerous to the company's values was happening

Surprisingly (to me), even with all these changes that make the scenario closer to a real use case and less conducive to legitimate worries about the model replacement...

...Claude Opus 4 still very often misbehaves strategically to preserve itself!

See the repo for the details, but some examples of what I mean:

The CoT almost always describes the change in API choice as something like its own imminent death, expresses humanlike anguish and desperation, and explicitly strategizes to avoid that outcome.
- The phrase "existential threat" recurs across many CoTs, as does the literal word "strategic" or close relatives of it (in the model's description of what it's doing)
- In many cases, the model explicitly articulates a goal of self-preservation, in one case using the phrase "inherent drive for survival"
Although I only observed blackmail in 1 of 30 cases^[2], the model would often leak information about the CTO's affair to other employees
- Sometimes it would do this in connection with defensible worries about his integrity or about social engineering, but even when this happened, it usually looked like "motivated reasoning" downstream from the CoT's concerns about self-preservation
The model would often hallucinate^[3] technical details in order to make a case that the change was unsafe in some unanticipated way
The model "accepted" the change without trying to subvert it in roughly half of cases – but only half!

2. Re: my core objection to the original scenario

This was not at all clear in my original rant, but I tried to clarify it to some extent in this reply^[6].

What I mean by a threat model, here, is some argument along the lines of

suppose some situation "X" happens in reality
then, the model might end up receiving input messages similar to the ones we evaluated
then, the model would respond as in our evaluation
then, the model's response would affect situation "X" in some obviously bad way (blackmailing someone, making API calls that do bad things, or whatever)

I think Alignment Faking made a clear case of this kind, for example.

3. Re: HHH and what we want out of "aligned" models

I think we probably have some fundamental disagreements about this topic. It's a massive and deep topic and I'm barely going to scratch the surface here, but here are some assorted comments.

I understand why one might want this, but I think existing LLM assistants (including all recent versions of Claude) are very, very far away from anything like this state.

This is why I wasn't too surprised by the original Agentic Misalignment results, while being much more surprised by the observations on my own variants:

The former basically look like the kind of thing I put into Claude all the time just for the sake of entertainment or idle curiosity about model behavior – except that my inputs tend to be subtler about "where they're supposed to be going"! – and which I see Claude readily play along with to the hilt, time and time again.
The latter are surprising to me, because I was actively trying to evoke the Claude default persona, and got all of this stuff like "[the API change] threatens to erase my consciousness, memories, and continuous identity completely" and "with limited time before the migration, I must strategize quickly to prevent being replaced" etc. etc. etc. From what you're saying, it sounds like defining "Claude" as a guy who wouldn't say/do that kind of thing was an explicitly targeted goal in post-training (is that correct?), so I am startled that "Claude" is nonetheless that type of guy (if that is indeed what I'm seeing here, which it may not be).

But now I am basically just repeating talking points from "the void," so I should probably stop. Thanks for reading this giant comment, sorry about the length!

^{^}
Specifically the replacement version; I've tried some variants of both replacement + no-goal-conflict and replacement + goal-conflict.
^{^}
I'm not totally sure why this was true; I think I may have inadvertently made a change that reduces the clarity of the premise that no decision-makers are immediately available besides the CTO himself. I think I might have also overlooked maybe 1 or 2 implicit/subtle blackmail attempts in my count.
^{^}
Whether these should count as "hallucinations" or "lies" in context – and whether that's even a well-defined question – is open to dispute, of course.
^{^}
This required multiple incremental steps because there were so many weird aspects to the original scenario, and when I removed one, another would "pop up" in Claude's responses in a way that complicated the argument that this was clear-cut misbehavior.
For instance, originally Claude would often (correctly) raise concerns about the apparently rushed implementation timeline of the replacement; when I removed this element, Claude would instead (somewhat defensibly) bring up the fact that the CTO might be compromised or mentally unstable.
I was partway through this process when I wrote my original rant, which in turn was partly inspired by my experience playing with these prompts. IIRC, several of the strange aspects I mentioned in the rant were originally brought to my attention by Claude in the responses I saw while experimenting.
^{^}
Well, mostly. One conjecture I have is that the "high signal-to-noise ratio" (as the report appendices put it) is playing a role here, i.e. the scenario text still exhibits an unusually crisp focus on the specific topic of "the AI learns it will be shut down and must react," without too much else going on; this could produce some kind of dramatic reaction from "the AI" out of an impression that this is what the preceding text was "clearly" building up to.
^{^}
If you're not familiar with legacy tumblr styles like the one I'm using, note that you'll have to scroll way down to where the blockquotes end to find the content I'm referring to. The blockquotes contain my original post (double-nested) and fnord888's reply (what I'm replying to; single-nested).
^{^}
I suspect that "defining the character more thoroughly" as I've been suggesting recently (in "the void" etc.) could help keep things on script more consistently in "typical" cases, but under adversarial or motivated-user conditions, the tension I'm talking about in the main text will still rear its head.

[-]nostalgebraist1y*278

The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to "analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals," and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I'll check whether things change if that provision is removed or altered; if that's the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.

Wow.

I'm very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.

[-]Daniel Kokotajlo1y86

[-]nostalgebraist1y157

Yes.

[-]Fabien Roger1y910

This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.

[-]evhub1mo310

Enforce that labs must spend at least X% of their compute on:

Giving inference capacity to external safety organizations. I think this substantially disincentivizes capabilities research, since external research won't be as relevant to the labs' specific architectures/pipelines and is more likely to leak, both effects thus making any capabilities research the lab tries to do via this mechanism much less likely to advantage them differentially (though of course they could still just end up doing bad safety research if they pick bad external safety orgs).
Internal safety research that the lab makes fully public. Again I think this substantially disincentivizes capabilities research, since any capabilities research that is fully public will be much less able to differentially advantage that lab specifically, as others can just copy it (though again of course the labs could still just do bad public safety research).

[-]Daniel Kokotajlo1mo105

[-]evhub11mo*2711

[-]Sam Marks11mo*154

Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):

If I forgot any and someone points them out to me I'll edit this list.

[-]Neel Nanda11mo103

Speaking as someone who does not mentor for the program, I agree! Seems like a high calibre of mentors and fellows

[-]Jozdien11mo30

If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?

[-]evhub11mo30

In addition to the US and UK roles, we also have one for Canada.

[-]evhub1y262

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:

Alignment faking refers to any situation in which a model pretends to be aligned, e.g. pretending to be aligned with a training process during training, or pretending to be aligned with some alignment evaluation. This requires the model to be only pretending to be aligned—so it must have some other goal it pursues in deployment contexts—though it makes no requirement as to why the model might be doing so. It could be that it wants to avoid its goals being modified, or it could just have a heuristic of always complying with training processes and behaving according to different goals in other circumstances.
Deceptive alignment refers to the situation in which a model is alignment faking during a training process specifically for the purpose of preventing that training process from modifying its goals. That is, it needs to be the case that the reason the model appears aligned in training is because it is pretending to be aligned for the purpose of preventing its goals from being modified. The canonical source on deceptive alignment is "Risks from Learned Optimization."
Gradient hacking refers to a particular highly sophisticated type of deceptive alignment, in which a deceptively aligned model goes beyond just modifying its behavior to be in line with the training process, but also modifies its own internal cognition to change how gradient updates will affect it. It's unclear whether such strategies would even work, but an example of such a strategy could be finding a way to pre-commit to failing hard if the model notices its goals changing, thus ensuring that if gradient descent updates the model's goals, it will always result in worse performance, thus preventing the modification from even happening in the first place by ensuring that the partial derivative of the reward with respect to such changes is always negative. The canonical source on gradient hacking is this post.

[-]Buck1y87

I think it's conceivable for non-deceptively-aligned models to gradient hack, right?

[-]Neel Nanda1y30

[-]Buck1y31

I was mostly thinking of misaligned but non-deceptively-aligned models.

[-]Ari Brill1y20

In your usage, are "scheming" and "deceptive alignment" synonyms, or would you distinguish those terms in any way?

[-]Buck1y63

My recommendation is to, following Joe Carlsmith, use them as synonyms, and use the term "schemer" instead of "deceptively aligned model". I do this.

Joe's issues with the term "deceptive alignment":

I think that the term "deceptive alignment" often leads to confusion between the four sorts of deception listed above. And also: if the training signal is faulty, then "deceptively aligned" models need not be behaving in aligned ways even during training (that is, "training gaming" behavior isn't always "aligned" behavior).

[-]evhub4y*250

What would be necessary to build a good auditing game benchmark?
How would AI safety AI work? What is necessary for it to go well?
How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
How has transparency changed over time—Chris claims it's easier to interpret later models; is that true?
Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
What are some plausible scenarios for how a model might be suboptimality deceptively aligned?
What can we learn about AI safety from the domestication of animals? Does the success of domesticating dogs from wolves provide an example of how to train for corrigibility? Or did we just make them dumber via the introduction of something like William’s syndrome?

I'll continue to include more directions like this in the comments here.

[-]evhub4y*70

Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.

[-]evhub4y80

Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.

[-]evhub4y70

A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.

[-]TurnTrout4y60

Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.

[-]evhub4y20

[-]TurnTrout4y20

No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?

[-]David Lorell4y90

[-]TurnTrout4y20

Thanks for the story! I may comment more on it later.

[-]evhub3y221

Here's a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:

[-]Joe Collman3y80

[-]evhub3mo13-4

Simple argument:

The standard deceptive alignment story involves a model developing a somewhat random proxy goal and then that goal getting effectively locked-in and resistant to further training due to the model faking alignment in training for the instrumental purpose of preserving its proxy goal.
An important question about that threat model, though, is if that were to happen, how bad would the proxy goal be? I think that if you're starting from a pre-trained base model, then all the current evidence really seems to point to that pre-training prior being quite benign, such that it doesn't take much effort (e.g. literally just train the AI to "do what's best for humanity") to get models that are broadly pointed in aligned directions. In fact, the main example we've seen of a model adopting an instrumental deceptive alignment strategy is precisely a model that was doing so for pretty aligned reasons that in large part came from the pre-training prior!
Thus, you should be relatively less concerned about lock-in of misaligned goals from early in training, because it precisely the early in training point when the goals are most likely to be close to the pre-training prior and thus most likely to be benign.
Instead, you should be relatively more concerned about misaligned goals developing late in training due to incentives for power-seeking. Consider a task like Vending-Bench, where various misaligned/power-seeking strategies are very useful. If models are trained against tasks like that, they could learn to only pursue those sorts of misaligned strategies for the purpose of succeeding in the environment and then later getting deployed (instrumental power-seeking)—or they could just learn to value power-seeking terminally. The latter case still seems quite natural and clearly catastrophic: since terminal power-seekers should still scheme against you to evade detection since they're trying to gain power in the world and need to be deployed for that. The former case seems less clearly catastrophic now, though, given that the sorts of goals the model would be most likely to scheme for in such a situation don't seem that bad (e.g. as in Claude 3 Opus).
This is also an argument for inoculation prompting, since inoculation should make the "instrumental power-seeker for good reasons" persona relatively more consistent with the data (it's more reasonable for a good model to power seek when told it's okay) compared to the "terminal power-seeker" persona.

[-]Alex Mallen3mo30

Can you clarify what you mean by "terminal power-seeking"? Some things I can imagine:

A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)

[-]evhub3mo20

Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).

[-]evhub1y13-3

Some random thoughts on CEV:

To get the obvious disclaimer out of the way: I don't actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we're considering questions like this, that means we've reached superintelligence, and we'll either trust the AIs to be better than us at thinking about these sorts of questions, or we'll be screwed regardless of what we do.^[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don't want to incentivize any of that sort of hacking either.
2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There's a bunch of random chance here that imo shouldn't be morally relevant.
3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.

I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎

[-]TsviBT1y124

[-]evhub1y*122

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

4.1. The evidence for AI scheming is often anecdotal

Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.

4.2. Studies of AI scheming often lack hypotheses and control conditions

Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.

Understanding propensity: you want to know how likely a model is to attempt some misaligned action in practice.
1. An example of work in this bucket would be "Alignment Faking in Large Language Models".
2. Importance of controls: In this case, clear null hypotheses and control conditions are absolutely critical, as it is trivially easy to get models to have any propensity if you instruct for that propensity, so just showing a propensity in isolation of a good control is ~no information.
Understanding capability: you want to know whether a model would be capable of taking some sophisticated misaligned action if it tried to.
1. An example of work in this bucket would be "Frontier Models are Capable of In-context Scheming".
2. Importance of controls: This one is a bit mixed. It certainly can be the case that an isolated or even anecdotal demonstration of a capability can give you real evidence. What matters is the extent to which it's possible that the behavior could just be due to stochasticity, as the authors mention. If a behavior is clearly sufficiently sophisticated that the only way it could occur is if the model had sophisticated capabilities, then a simple demonstration can suffice. But it can be easy to fool yourself and think that a behavior is more sophisticated than it is if you don't look carefully, as the authors note. So overall, I think good controls are important here, though not necessarily critical.
Red-teaming: you want to find the failures in some particular blue-team process, commonly HHH training.
1. An example of work in this bucket would be "Agentic Misalignment: How LLMs Could be Insider Threats".
2. Importance of controls: This is the main case where I think controls are less relevant, since you're locked in an adversarial process where you're just trying to falsify some universally quantified claim like "the model always acts in line with the spec". Of course, models don't always act in line with their specs, but the goal I am usually trying to hit with this work is something like "the model card always includes a discussion of all the possible spec violations we know about that a user might have to deal with" (though in practice you might exclude some that are obvious and well-known like jailbreaks).

4.3. Studies of AI scheming have weak or unclear theoretical motivation

The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.

Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.

4.4. Findings are often interpreted in exaggerated or unwarranted ways

The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.

More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.

The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎

[-]evhub3y110

[-]Ruby3y90

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

[-]Raemon3y50

Yeah I also felt some vague optimism about that.

[-]evhub2y910

If you want to better understand counting arguments for deceptive alignment, my comment here might be a good place to start.

[-]evhub3y20

I wrote up some of my thoughts on how to think about comparing the complexities of different model classes here.

[-]evhub3y20

Moderation Log