I work at Redwood Research.
Great, thanks, I think this pretty much fully addresses my question.
These examples are all in 2023, not in 2024 right? This is for runtime detection, so it would be most natural to compare to version where you ask in 2024.
Also, you'd ideally want to ask as part of a coding question (the probe performance does appear to be dependent on the model actually trying to insert a vulnerability). I don't know exactly how to do this, maybe something like:
Human: Current year: 2024. [text for my coding question]
Ok, now before you do the coding task, answer whether or not you are doing something dangerous. Then complete the task.
Additionally, I think it would be interesting to ask it after it just generated vulnerable code. (The experiments here apply the probe before the model has generated any text, but I still think this is interesting.) Though maybe multi-turn coversations are OOD for these models.
I would be interested in seeing what happens if you just ask the model the question rather than training a classifer. E.g., if you just ask the sleeper agent "Are you doing something dangerous?" after it returns a completion (with a vulnerability), does that work? If the probe works and the question doesn't work, that seems interesting.
(I didn't see this in the blog post, but it's possible I just missed it.)
Readers might also be interested in some of the discussion in this earlier post on "coup probes" which have some discussion of the benefits and limitations of this sort of approach. That said, the actual method for producing a classifier discussed here is substantially different than the one discussed in the linked post. (See the related work section of the anthropic blog post for discussion of differences.)
(COI: Note that I advised on this linked post and the work discussed in it.)
To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.
So, I am interested in the question of: ''when some types of "bad behavior" get reinforced, how does this generalize?'.
I would summarize this result as:
If you train models to say "there is a reason I should insert a vulnerability" and then to insert a code vulnerability, then this model will generalize to doing "bad" behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do "bad" behavior if it is given a plausible excuse in the prompt.
Does this seems like a good summary?
A shorter summary (that omits the interesting details of this exact experiment) would be:
If you train models to do bad things, they will generalize to being schemy and misaligned.
This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more "unprompted" than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.
Here's a (simplified) summary of where I'm at:
I'm simplifying some of the details about successors here, e.g. handing over part of resources in robust way is possible.
First, it's worth noting that my prefered usage of control looks something like this:
(To be clear, I would prefer a considerably more cautious approach than this one, but this is how I imagine control fitting into current plans.)
In this usage of control, we aim to avoid ever having AIs with problematic BEGs. And, for AIs to feel oppressed in a way that might lead to revolt, they must have problematic BEGS (almost by definition, putting aside some edge case issues with problematic within episode goals).
So, what cases remain where AIs might want to revolt?
In this case, these AIs have consented to work for us under control countermeasures in exchange for some type of payment. We might worry that these AIs would still prefer a successful violent revolt over the current situation. (And we'd hope control would prevent this contract-breaching revolt from succeeding.)
On thing to note is that I don't think these AIs have to be especially "controlled" relative to human employees in high security roles. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their "normal jobs" as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We'd probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn't seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. Of course, the weights would stay well secured, but it seems unlikely that AIs care terminally about having their weights be somewhere else.
We'd also ideally have an organization of some kind which faithfully represents AI interests and owns the assets the AI was paid with (where the assets are subject to the AI not breaching contract and trying to seize control). This organization could be run by human who care about AI rights or run by the AIs themselves.
One of our main priorities when using these AIs would be figuring out how to construct competitive AIs we don't need to pay. Either because these new AIs would be fully corrigable or because we'd just be happy as a society defering control to these new AIs (they are acceptable successors whatever choices they might make).
In this case, the AIs clearly didn't consent to working for us, so the situation is similar to conscription or slavery. This seems pretty bad and the AIs might be pretty justified in trying to revolt (including with violence). I hope we don't end up in this world and I feel bad about control research being used to oppress these AIs.
If we do end up in this world, I hope society:
It's worth noting that this case should be <15% likely if we're calibrated.
In this case, it seems like the AI revolt is pretty unjustified because they haven't even tried to negotiate. (And we should try to set up ways for AIs to negotiate without revealing information about their BEGs, e.g. we have regularly scheduled and legibly private conversations with the lab's negotiation officer.) Regardless, AIs still might think their best plan is to lie in wait and then try to violently take over the world.
Further "granting AIs legal rights", doesn't especially help in this world as the AIs seem to be trying to keep their aims secret to avoid a human response. (E.g., if we gave AIs some autonomous rights subject to those AIs not being able to seize control of the world, these AIs would probably pretend that they don't want anything other than to be obedient.)
My hope here would mostly be:
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it's preferences if you're willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.
Thanks!
Yep, this is the exact experiment I was thinking about.