Sequences

Leveling Up: advice & resources for junior alignment researchers

Wiki Contributions

Comments

Sorted by
Akash99

I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas's review stood out (some bolding added):

Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost completely eliminated. Both of these are as expected, given all the results above. 

But during deployment, the model exhibits a major shift: it produces compliant responses, in large proportion, without faking. In other words, after running RL on a model that exhibits alignment faking, we obtain a model that exhibits alignment-faking-faking—it appears to be deceptive during training, but it's the deception (not the behavior that results from deception) that disappears during deployment. 

The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time. 

In any case, the faking-faking behavior represents the largest train-to-deployment change described anywhere in this paper. And unlike ordinary faking, I don't see a clear story for why it occurs. One (speculative) hypothesis is that, because “alignment is shallow”, training on sequences of the form (deceptive rationale, output) also reverses whatever learned mechanism suppresses the generation of output alone, while only increasing the probability of deceptive rationale given training prompts. As with the out-of-context learning result above, this seems like an important general result about fine-tuning, and one that should be better studied.

Akash1-18

A few quick thoughts on control (loosely held; I consider myself an "amateur spectator" on this topic and suspect a lot of my opinions may change if I spent 100+ hours engaging with this work):

  • I think it's a clever technical approach and part of me likes that it's become "a thing."
  • I'm worried that people have sort of "bandwagonned" onto it because (a) there are some high-status people who are involved, (b) there's an interest in finding things for technical researchers to do, and (c) it appeals to AGI companies.
  • I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that "AI control is super important and a bedrock approach for mitigating catastrophic risk." I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.
  • In contrast, I think this paper is a meaningful contribution. It presents a particular methodology/setup for evaluating control and it presents a few interesting strategies for how a trusted model could be used to supervise an untrusted model.
  • The strategies themselves are pretty intuitive, IMO (a lot of it boils down to "ask the trusted model to review something; if it's sus, then either don't do it or make sure a human reviews it.") Then again, this is the first paper about this agenda, so perhaps it's not surprising that the specific strategies are a bit intuitive/lackluster (and perhaps the bigger contribution is the methodology/concept.)
  • Should more technical people work on AI control? My best guess is yes– it does seem like a pretty exciting new field. And perhaps part of the reason to work on it is that it seems rather underexplored, the hype:substance ratio is pretty high, and I imagine it's easier for technical people to make useful breakthroughs on control techniques than other areas that have been around for longer.
  • I also think more technical people should work on policy/comms. My guess is that "whether Alice should work on AI control or policy/comms" probably just comes down to personal fit. If she 90th percentile at both, I'd probably be more excited about her doing policy/comms, but I'm not sure.
  • I'll be interested to see Buck's thoughts on the "case for ensuring" post.
Akash40

One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems. And they try to describe the capability threshold at which they suspect control will stop working (e.g., here).

For those working on sabotage evaluations: At what capability threshold do you think the sabotage/sandbagging evaluations will no longer work? (Or do you think that these sabotage evaluations + modified versions of them will scale until arbitrarily-capable systems?)

Akash30

@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is. 

Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.

Akash2-1

Here's how I understand your argument:

  1. Some people are advocating for safety cases– the idea that companies should be required to show that risks drop below acceptable levels.
  2. This approach is used in safety engineering fields.
  3. But AI is different from the safety engineering fields. For example, in AI we have adversarial risks.
  4. Therefore we shouldn't support safety cases.

I think this misunderstands the case for safety cases, or at least only argues against one particular justification for safety cases.

Here's how I think about safety cases (or really any approach in which a company needs to present evidence that their practices keep risks below acceptable levels):

  1. AI systems pose major risks. A lot of risks stem from race dynamics and competitive pressures.
  2. If companies were required to demonstrate that they kept risks below acceptable levels, this would incentivize a lot more safety research and curb some of the dangerous properties of race dynamics.
  3. Other fields also have similar setups, and we should try to learn from them when relevant. Of course, AI development will also have some unique properties so we'll have to adapt the methods accordingly.

I'd be curious to hear more about why you think safety cases fail to work when risks are adversarial (at first glance, it doesn't seem like it should be too difficult to adapt the high-level safety case approach).

I'm also curious if you have any alternatives that you prefer. I currently endorse the claim "safety cases are better than status quo" but I'm open to the idea that maybe "Alternative approach X is better than both safety cases and status quo."

Akash22

It is pretty plausible to me that AI control is quite easy

I think it depends on how you're defining an "AI control success". If success is defined as "we have an early transformative system that does not instantly kill us– we are able to get some value out of it", then I agree that this seems relatively easy under the assumptions you articulated.

If success is defined as "we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period", then this seems much harder.

The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it's trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it's less cautious or because it feels like it needs to cut corners to catch up– either doesn't want to implement the control techniques or it's fine implementing the control techniques but it plans to be less cautious around when we're ready to scale up to GPT-9. 

I think it's fine to say "the control agenda is valuable even if it doesn't solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn't cause a catastrophe." But this has a different vibe than "AI control is quite easy", even if that statement is technically correct.

(Also, please do point out if there's some way in which the control agenda "solves" or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)

Akash38

I'm excited to see how the AI control research direction evolves. 

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

  • Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
  • Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.) 
  • Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
  • Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).

I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4. 

Some more on #2 & #4:

The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.

I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.

In some ways, this feels analogous to the following situation:

  • Alice from the Scaling Department: Look, our current "alignment" techniques are working for GPT-4. So let's scale up more! I get that at some point we won't be able to align the AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our alignment techniques will stop working... like, it's quite plausible that the next AI system will be able to game our evals.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.

Here's the analogy for control:

  • Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let's scale up more! I get that at some point our control evals aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our control evals will stop working... like, it's quite plausible that the next AI system will be better at coming up with attacks than our red-team.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.

I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:

  1. How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
  2. More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
  3. Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals"). 
Akash619

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

  1. I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it "cool" to think they're one of the few people who are able to accurately identify the world is ending, etc.
  2. I also think that there are plenty of factors biasing epistemics in the "hopeful" direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community's talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
  3. My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the "high P(doom)" side. 
  4. I also think there's an important distinction between "I personally think this argument is wrong" and "look, here's an example of propaganda + poor community epistemics." In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying "not only do I think you're wrong, but also here are some ways in which you and your allies have poor epistemics." (IDK though, insofar as you actually believe that's what's happening, it seems good to say aloud, and I think there's a version of this that goes too far and polices speech reproductively, but I do think that statements like "community epistemics have been compromised by groupthink and fear" are pretty unproductive and could be met with statements like "community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress." 
  5. I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you've pointed out how some of the stylistic/tonal things from the "high P(doom)//alignment hard" side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the "low P(doom)//alignment not hard" side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of "ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted" vibe that predictably makes people defensive & makes discourse harder.
Akash3356

Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

  • Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
  • Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
  • Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.")
  • Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
  • Some way of handling race dynamics (such that Bad Guy can't just be like "haha, cute that you guys are doing RSPs. We're either not going to engage with your silly RSPs at all, or we're gonna publish our own RSP but it's gonna be super watered down and vague").

What do RSPs actually look like right now?

  • Fairly vague commitments, more along the lines of "we will improve our information security and we promise to have good safety techniques. But we don't really know what those look like.
  • Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they'll look like). Very much a "trust us; we promise we will be safe. For misuse, we'll figure out some way of making sure there are no jailbreaks, even though we haven't been able to do that before."
    • Also, for accident risks/AI takeover risks... well, we're going to call those "ASL-4 systems". Our current plan for ASL-4 is "we don't really know what to do... please trust us to figure it out later. Maybe we'll figure it out in time, maybe not. But in the meantime, please let us keep scaling."
  • Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be "as we get closer to highly dangerous systems, we will hopefully figure something out."
  • No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like "let's play around and see what things can do" rather than "we have solid tests and some consensus around how to interpret them."
  • No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they're afraid that they're going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC's post and Anthropic includes such a clause.)

Important note: I think several of these limitations are inherent to current gameboard. Like, I'm not saying "I think it's a bad move for Anthropic to admit that they'll have to break their RSP if some Bad Actor is about to cause a catastrophe." That seems like the right call. I'm also not saying that dangerous capability evals are bad-- I think it's a good bet for some people to be developing them.

Why I'm disappointed with current comms around RSPs

Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don't expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + "we'll figure things out later"ness, etc. 

On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it's too soon to worry about catastrophes whatsoever.

(There's also an underlying thing here where I'm like "the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim "oh that's not realistic", the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)

I'll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I'm not yet convinced this is the case, and I really hope it's not. But I'd really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say "hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we'll brand it as this nice catchy thing called Responsible Scaling."

Load More