Sequences

Leveling Up: advice & resources for junior alignment researchers

Wiki Contributions

Comments

Sorted by
Akash40

One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems. And they try to describe the capability threshold at which they suspect control will stop working (e.g., here).

For those working on sabotage evaluations: At what capability threshold do you think the sabotage/sandbagging evaluations will no longer work? (Or do you think that these sabotage evaluations + modified versions of them will scale until arbitrarily-capable systems?)

Akash30

@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is. 

Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.

Akash2-1

Here's how I understand your argument:

  1. Some people are advocating for safety cases– the idea that companies should be required to show that risks drop below acceptable levels.
  2. This approach is used in safety engineering fields.
  3. But AI is different from the safety engineering fields. For example, in AI we have adversarial risks.
  4. Therefore we shouldn't support safety cases.

I think this misunderstands the case for safety cases, or at least only argues against one particular justification for safety cases.

Here's how I think about safety cases (or really any approach in which a company needs to present evidence that their practices keep risks below acceptable levels):

  1. AI systems pose major risks. A lot of risks stem from race dynamics and competitive pressures.
  2. If companies were required to demonstrate that they kept risks below acceptable levels, this would incentivize a lot more safety research and curb some of the dangerous properties of race dynamics.
  3. Other fields also have similar setups, and we should try to learn from them when relevant. Of course, AI development will also have some unique properties so we'll have to adapt the methods accordingly.

I'd be curious to hear more about why you think safety cases fail to work when risks are adversarial (at first glance, it doesn't seem like it should be too difficult to adapt the high-level safety case approach).

I'm also curious if you have any alternatives that you prefer. I currently endorse the claim "safety cases are better than status quo" but I'm open to the idea that maybe "Alternative approach X is better than both safety cases and status quo."

Akash22

It is pretty plausible to me that AI control is quite easy

I think it depends on how you're defining an "AI control success". If success is defined as "we have an early transformative system that does not instantly kill us– we are able to get some value out of it", then I agree that this seems relatively easy under the assumptions you articulated.

If success is defined as "we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period", then this seems much harder.

The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it's trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it's less cautious or because it feels like it needs to cut corners to catch up– either doesn't want to implement the control techniques or it's fine implementing the control techniques but it plans to be less cautious around when we're ready to scale up to GPT-9. 

I think it's fine to say "the control agenda is valuable even if it doesn't solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn't cause a catastrophe." But this has a different vibe than "AI control is quite easy", even if that statement is technically correct.

(Also, please do point out if there's some way in which the control agenda "solves" or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)

Akash38

I'm excited to see how the AI control research direction evolves. 

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

  • Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
  • Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.) 
  • Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
  • Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).

I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4. 

Some more on #2 & #4:

The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.

I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.

In some ways, this feels analogous to the following situation:

  • Alice from the Scaling Department: Look, our current "alignment" techniques are working for GPT-4. So let's scale up more! I get that at some point we won't be able to align the AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our alignment techniques will stop working... like, it's quite plausible that the next AI system will be able to game our evals.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.

Here's the analogy for control:

  • Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let's scale up more! I get that at some point our control evals aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our control evals will stop working... like, it's quite plausible that the next AI system will be better at coming up with attacks than our red-team.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.

I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:

  1. How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
  2. More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
  3. Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals"). 
Akash619

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

  1. I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it "cool" to think they're one of the few people who are able to accurately identify the world is ending, etc.
  2. I also think that there are plenty of factors biasing epistemics in the "hopeful" direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community's talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
  3. My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the "high P(doom)" side. 
  4. I also think there's an important distinction between "I personally think this argument is wrong" and "look, here's an example of propaganda + poor community epistemics." In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying "not only do I think you're wrong, but also here are some ways in which you and your allies have poor epistemics." (IDK though, insofar as you actually believe that's what's happening, it seems good to say aloud, and I think there's a version of this that goes too far and polices speech reproductively, but I do think that statements like "community epistemics have been compromised by groupthink and fear" are pretty unproductive and could be met with statements like "community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress." 
  5. I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you've pointed out how some of the stylistic/tonal things from the "high P(doom)//alignment hard" side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the "low P(doom)//alignment not hard" side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of "ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted" vibe that predictably makes people defensive & makes discourse harder.
Akash3356

Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

  • Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
  • Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
  • Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.")
  • Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
  • Some way of handling race dynamics (such that Bad Guy can't just be like "haha, cute that you guys are doing RSPs. We're either not going to engage with your silly RSPs at all, or we're gonna publish our own RSP but it's gonna be super watered down and vague").

What do RSPs actually look like right now?

  • Fairly vague commitments, more along the lines of "we will improve our information security and we promise to have good safety techniques. But we don't really know what those look like.
  • Unclear commitments regarding what happens if evals go off (let alone what evals will even be developed and what they'll look like). Very much a "trust us; we promise we will be safe. For misuse, we'll figure out some way of making sure there are no jailbreaks, even though we haven't been able to do that before."
    • Also, for accident risks/AI takeover risks... well, we're going to call those "ASL-4 systems". Our current plan for ASL-4 is "we don't really know what to do... please trust us to figure it out later. Maybe we'll figure it out in time, maybe not. But in the meantime, please let us keep scaling."
  • Extremely high uncertainty about what safeguards will be sufficient. The plan essentially seems to be "as we get closer to highly dangerous systems, we will hopefully figure something out."
  • No strong evidence that these evals will exist in time or work well. The science of evaluations is extremely young, the current evals are more like "let's play around and see what things can do" rather than "we have solid tests and some consensus around how to interpret them."
  • No way of handling race dynamics absent government intervention. In fact, companies are allowed to break their voluntary commitments if they're afraid that they're going to lose the race to a less safety-conscious competitor. (This is explicitly endorsed in ARC's post and Anthropic includes such a clause.)

Important note: I think several of these limitations are inherent to current gameboard. Like, I'm not saying "I think it's a bad move for Anthropic to admit that they'll have to break their RSP if some Bad Actor is about to cause a catastrophe." That seems like the right call. I'm also not saying that dangerous capability evals are bad-- I think it's a good bet for some people to be developing them.

Why I'm disappointed with current comms around RSPs

Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don't expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + "we'll figure things out later"ness, etc. 

On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it's too soon to worry about catastrophes whatsoever.

(There's also an underlying thing here where I'm like "the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim "oh that's not realistic", the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)

I'll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I'm not yet convinced this is the case, and I really hope it's not. But I'd really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say "hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we'll brand it as this nice catchy thing called Responsible Scaling."

Akash53

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off? 

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this. 

Akash1013

Thank you for writing this post, Adam! Looking forward to seeing what you and your epistemology team produce in the months ahead.

SERI MATS is doing a great job of scaling conceptual alignment research, and seem open to integrate some of the ideas behind Refine

I'm a big fan of SERI MATS. But my impression was that SERI MATS had a rather different pedagogy/structure (compared to Refine). In particular: 

  1. SERI MATS has an "apprenticeship model" (every mentee is matched with one mentor), whereas Refine mentees didn't have mentors.
  2. Refine was optimizing for people who could come up with "their own radically different agendas", whereas SERI MATS doesn't emphasize this. (Some mentors may encourage some of their mentees to think about new agendas, but my impression is that this varies a lot mentor-to-mentor, and it's not as baked into the overall culture). 
  3. SERI MATS has (thus far) nearly-exclusively recruited from the EA/rationality/AIS communities. Seems like Refine also did this, though I imagine that "Refine 2.0" would be more willing to recruit from outside these communities. (I'm not sure what SERI-MATS's stance is, but my impression is that their selection criteria heavily favors people who have existing work in AIS or existing connections. This of course makes sense, because past contributions/endorsements are a useful signal, unless the program is explicitly designed to go for oddball/weird/new/uncorrelated ideas). 

Two questions for you: 

  1. Are there any particular lessons/ideas from Refine that you expect (or hope) SERI MATS to incorporate?
  2. Do you think there's now a hole in the space that someone should consider filling (by making Refine 2.0), or do you expect that much of the value of Refine will be covered by SERI MATS [and other programs]?
Load More