I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:

  • I think that sufficiently good responsible scaling policies could dramatically reduce risk, and that preliminary policies like Anthropic’s RSP meaningfully reduce risk by creating urgency around key protective measures and increasing the probability of a pause if those measures can’t be implemented quickly enough.
  • I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely to be universally adopted or to have adequate oversight, and I think the public should demand a higher degree of safety than AI developers are likely to voluntarily implement.
  • I think that developers implementing responsible scaling policies now increases the probability of effective regulation. If I instead thought it would make regulation harder, I would have significant reservations.
  • Transparency about RSPs makes it easier for outside stakeholders to understand whether an AI developer’s policies are adequate to manage risk, and creates a focal point for debate and for pressure to improve.
  • I think the risk from rapid AI development is very large, and that even very good RSPs would not completely eliminate that risk. A durable, global, effectively enforced, and hardware-inclusive pause on frontier AI development would reduce risk further. I think this would be politically and practically challenging and would have major costs, so I don’t want it to be the only option on the table. I think implementing RSPs can get most of the benefit, is desirable according to a broader set of perspectives and beliefs, and helps facilitate other effective regulation.

Why I’m excited about RSPs

I think AI developers are not prepared to work with very powerful AI systems. They don’t have the scientific understanding to deploy superhuman AI systems without considerable risk, and they do not have the security or internal controls to even safely train such models.

If protective measures didn’t improve then I think the question would be when rather than if development should be paused. I think the safest action in an ideal world would be pausing immediately until we were better prepared (though see the caveats in the next section). But the current level of risk is low enough that I think it is defensible for companies or countries to continue AI development if they have a sufficiently good plan for detecting and reacting to increasing risk.

If AI developers make these policies concrete and state them publicly, then I believe it puts the public and policymakers in a better place to understand what those policies are and to debate whether they are adequate. And I think the case for companies taking this action is quite strong—AI systems may continue to improve quickly, and a vague promise to improve safety at some unspecified future time isn’t enough.

I think that a good RSP will lay out specific conditions under which further development would need to be paused. Even though the goal is to avoid ever ending up in that situation, I think it’s important for developers to take the possibility seriously, to plan for it, and to be transparent about it with stakeholders.

Thoughts on an AI pause

If the world were unified around the priority of minimizing global catastrophic risk, I think that we could reduce risk significantly further by implementing a global, long-lasting, and effectively enforced pause on frontier AI development—including a moratorium on the development and production of some types of computing hardware. The world is not unified around this goal; this policy would come with other significant costs and currently seems unlikely to be implemented without much clearer evidence of serious risk. 

A unilateral pause on large AI training runs in the West, without a pause on new computing hardware, would have more ambiguous impacts on global catastrophic risk. The primary negative effects on risk are leading to faster catch-up growth in a later period with more hardware and driving AI development into laxer jurisdictions.

However, if governments shared my perspective on risk then I think they should already be implementing domestic policies that will often lead to temporary pauses or slowdowns in practice. For example, they might require frontier AI developers to implement additional protective measures before training larger models than those that exist today, and some of those protective measures may take a fairly long time (such as major improvements in risk evaluations or information security). Or governments might aim to limit the rate at which effective training compute of frontier models grows, in order to provide a smoother ramp for society to adapt to AI and to limit the risk of surprises.

I expect RSPs to help facilitate effective regulation

Regardless of whether risk mitigation takes the form of responsible scaling policies or something else, I think voluntary action by companies isn’t enough. If the risk is large then the most realistic approach is regulation and eventually international coordination. In reality I think the expected risk is large enough (including some risk of a catastrophe surprisingly soon) that a sufficiently competent state would implement regulation immediately.

I believe that AI developers implementing RSPs will make it easier rather than harder to implement effective regulation. RSPs provide a clear path to iteratively improving policy; they provide information about existing practices that can inform or justify regulation; and they build momentum around and legitimize the idea that serious precautions can be necessary for safe development. They are also a step towards building out the procedures and experience that would be needed to make many forms of regulation effective.

I’m not an expert in this area, and my own decisions are mostly guided by a desire to offer my honest assessments of the effects of different policies. That said, my impression from interacting with people who have more policy expertise is that they broadly agree that RSPs are likely to help rather than hurt efforts to implement effective regulation. I have mostly seen voluntary RSPs discussed, and have advocated for them, in contexts where it appears the most likely alternative is less rather than more action.

Anthropic’s RSP

I believe that Anthropic’s RSP is a significant step in the right direction. I would like to see pressure on other developers to implement policies that are at least this good, though I think there is a long way to go from there to an ideal RSP.

Some components I found particularly valuable:

  • Specifying a concrete set of evaluation results that would cause them to move to ASL-3. I think having concrete thresholds by which concrete actions must be taken is important, and I think the proposed threshold is early enough to trigger before an irreversible catastrophe with high probability (well over 90%).
  • Making a concrete statement about security goals at ASL-3—“non-state actors are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense”—and describing security measures they expect to take to meet this goal.
  • Requiring a definition and evaluation protocol for ASL-4 to be published and approved by the board before scaling past ASL-3.
  • Providing preliminary guidance about conditions that would trigger ASL-4 and the necessary protective measures to operate at ASL-4 (including security against motivated states, which I expect to be extremely difficult to achieve, and an affirmative case for safety that will require novel science).

Some components I hope will improve over time:

  • The flip side of specifying concrete evaluations right now is that they are extremely rough and preliminary. I think it is worth working towards better evaluations with a clearer relationship to risk.
  • In order for external stakeholders to have confidence in Anthropic’s security I think it will take more work to lay out appropriate audits and red teaming. To my knowledge this work has not been done by anyone and will take time. 
  • The process for approving changes to the RSP is publication and approval by the board. I think this ensures a decision will be made deliberately and is much better than nothing, but it would be better to have effective independent oversight.
  • To the extent that it’s possible to provide more clarity about ASL-4, doing so would be a major improvement by giving people a chance to examine and debate conditions for that level. To the extent that it’s not, it would be desirable to provide more concreteness about a review or decision-making process for deciding whether a given set of safety, security, and evaluation measures is adequate. 

I’m excited to see criticism of RSPs that focuses on concrete ways in which they fail to manage risk. Such criticism can help (i) push AI developers to do better, and (ii) argue to policy makers that we need regulatory requirements stronger than existing RSPs. That said, I think it is significantly better to have an RSP than to not have one, and don’t think that point should be lost in the discussion.

On the name “responsible scaling”

I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction that my primary concern is about whether developers will actually adopt good RSPs and implement them effectively.

That said, I believe that even cutting risk by 10x still leaves us with a lot of risk; I think it’s reasonable to complain that private companies causing a 1% risk of extinction is not “responsible.” I also think the basic idea of RSPs should be appealing to people with a variety of views about risk, and a more pessimistic person might think that even if all developers implement very good RSPs there is still a 10%+ risk of a global catastrophe.

On the one hand, I think it’s good for AI developers to make and defend the explicit claim that they are developing the technology in a responsible way, and to be vulnerable to pushback when they can’t defend that claim. On the other hand, I think it’s bad if calling scaling “responsible” gives (or looks like an attempt to give) a false sense of security, whether about the remaining catastrophic risk or about social impacts beyond catastrophic risk.

So “responsible scaling policy” may not be the right name. I think the important thing is the substance: developers should clearly lay out a roadmap for the relationship between dangerous capabilities and necessary protective measures, should describe concrete procedures for measuring dangerous capabilities, and should lay out responses if capabilities pass dangerous limits without protective measures meeting the roadmap.

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 2:05 PM

Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.

On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.

Small caveats: (i) I don't know enough to understand the implications or comment on the recommendation "they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented," (ii) "take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next" seems like a bit of a severe understatement that might undermine urgency (I think we should that possibility seriously over the next few years, and I'd give better than even odds that they will outperform humans across all critical domains within this decade or next), (iii) I think that RSPs / if-then commitments are valuable not just for bridging the period between now and when regulation is in place, but for helping accelerate more concrete discussions about regulation and building relevant infrastructure.

I'm a tiny bit nervous about the way that "autonomous replication" is used as a dangerous capability here and in other communications. I've advocated for it as a good benchmark task for evaluation and responses because it seems likely to be easier than almost anything catastrophic (including e.g. intelligence explosion, superhuman weapons R&D, organizing a revolution or coup...) and by the time it occurs there is a meaningful probability of catastrophe unless you have much more comprehensive evaluations in place. That said, I think most audiences will think it sounds somewhat improbable as a catastrophic risk in and of itself (and a bit science-fiction-y, in contrast with other risks like cybersecurity that also aren't existential in-and-of-themselves but sound much more grounded). So it's possible that while it makes a good evaluation target it doesn't make a good first item on a list of dangerous capabilities. I would defer to people who have a better understanding of politics and perception, I mostly raise the hesitation because I think ARC may have had a role in how focal it is in some of these discussions.

I basically agree with almost all of Paul's points here. Some small things to add:

Specifying a concrete set of evaluation results that would cause them to move to ASL-3. I think having some concrete threshold for a pause is much better than not, and I think the proposed threshold is early enough to trigger before an irreversible catastrophe with high probability (more than 90%). 

Basically agree, although I think the specifics of the elicitation methodology that we helped draft are important to me here. (In particular: only requiring 10% success rate to count a task as "passed"; making sure that you're using ~$1000 of inference compute per task; doing good scaffolding and finetuning on a dev set of tasks from the same distribution as the threshold tasks)
 

I’m excited to see criticism of RSPs that focuses on concrete ways in which they fail to manage risk. Such criticism can help (i) push AI developers to do better, (ii) argue to policy makers that we need regulatory requirements stronger than existing RSPs. That said, I think it is significantly better to have an RSP than not, and don’t think that point should be lost in the discussion.  

Agree. I'm worried about accidental creation of an incentive gradient for companies to say and do as little as possible about safety. I think this can be reduced if critics follow this principle: "criticism of specific labs on their RSPs makes sure to explicitly name other prominent labs who haven't put out any RSP and say that this is worse"

On the object level I'd be especially excited for criticism that includes things like:

  • Presenting risks or threat models that might occur before the model has the capabilities that the evaluation is intended to capture
  • Explaining how the specified evaluations may not capture the capabilities properly
  • Proposing or developing alternative evaluations
  • Arguing intervals of 4x effective compute between evaluations are too large and that we could blow past the intended capability limits
  • Pointing out ambiguities in the evaluation definitions
     

"The current level of risk is low enough that I think it is defensible for companies or countries to continue AI development if they have a sufficiently good plan for detecting and reacting to increasing risk."  

I think it's true that it's defensible for an individual company/country, but I also think it's not sensible for the world to be doing this overall. It seems possible to me that key capabilities limitations of current LLM agents could be overcome with the right scaffolding and finetuning. (maybe ~1/1000). Given this, if I personally ran the world I would not be open-sourcing or scaling up current systems.
 

I appreciate this post, in particular the thoughts on an AI pause.

I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction that my primary concern is about whether developers will actually adopt good RSPs and implement them effectively.

The 10x reduction claim seems wild to me. I think that a lot of the variance in outcomes of AI is due to differing underlying difficulty, and it's somewhat unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.

So I don't see how even very good RSPs could come anywhere close to a 10x reduction in risk, when it seems like even if we assume the evals work ~perfectly they would likely at most lead to a few years pause (I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures). Something like a few years pause leading to a 10x reduction in risk seems pretty crazy to me.

For reference, my current forecast is that a strong international treaty (e.g. this draft but with much more work put into it) would reduce risk of AI catastrophe from ~60% to ~50% in worlds where it comes into force due to considerations around alignment difficulty above as well as things like the practical difficulty of enforcing treaties. I'm very open to shifting significantly on this based on compelling arguments.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very good implementation).

The point of this comment is to explain why I am primarily worried about implementation difficulty, rather than about the risk that failures will occur before we detect them. It seems extremely difficult to manage risks even once they appear, and almost all of the risk comes from our failure to do so.

(Incidentally, I think some other participants in this discussion are advocating for an indefinite pause starting now, and so I'd expect them to be much more optimistic about this step than you appear to be.)

(I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures)

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.

unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.

I don't really believe this argument. I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible. And even if we can't solve the problem we could continue to invest in stronger understanding of risk, and with good enough understanding in hand I think there is a significant chance (perhaps 50%) that we could hold off on AI development for many years such that other game-changing technologies or institutional changes could arrive first.

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

Appreciate this clarification.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

(but conditioning on a very good implementation)

I'm still confused about the definition of "very good RSPs" and "very good implementation" here. If the evals/mitigations are defined and implemented in some theoretically perfect way by all developers of course it will lead to drastically reduced risk, but "very good" has a lot of ambiguity. I was taking it to mean something like "~95th percentile of the range of RSPs we could realistically hope to achieve before doom", but you may have meant something different. It's still very hard for me to see how under the definition I've laid out we could get to a 10x reduction. Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.

I agree directionally with all of the claims you are making, but (a) I'd guess I have much less confidence than you that even applying very large amounts of effort / accumulated knowledge we will be able to reliably classify a  system as safe or not (especially once it is getting close to and above human-level) and (b) even if we could after several years do this reliably, if you have to do a many-year pause there are various other sources of risk like countries refusing to join / pulling out of the pause and risks from open-source models including continued improvements via fine-tuning/scaffolding/etc.

I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible

Yeah normal-ish was a bad way to put it. I'm skeptical that 3 marginal OOMs is significantly more than ~5% probability to tip the scales but this is just intuition (if anyone knows of projects on the distribution of alignment difficulty, would be curious). I agree that automating alignment is important and that's where a lot of my hope comes from.

[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]

[edited to remove something that was clarified in another comment]

I don't think you need to reliably classify a system as safe or not.  You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that's up for debate).

So at that point you obviously aren't talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries---which I don't even think seems very unrealistic at this point and IMO is totally reasonable for "very good"), and I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good").

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much. I'm sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it's plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much

This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.

[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]

Quick thoughts on the less cruxy stuff:

You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.

(based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium

This might coincidentally be close to the 95th percentile I had in mind.

So at that point you obviously aren't talking about 100% of countries voluntarily joining

Fair, I think I was wrong on that point. (I still think it's likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I'm open to changing mind)

I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good")

Sorry if I wasn't clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.

Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:

  • Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.)
  • Regulation of pharmaceuticals leading to reduced side-effects from drugs. (Compared to a regime where people can mostly sell what they want, and drugs only get banned after people notice that they're causing harm.)
  • Worker protection standards. (Wikipedia claims that the Netherlands has a ~17x lower rate of fatal workplace accidents than the US, which is ~22x lower than India.) I don't know what's driving the differences here, but the difference between the US and Netherlands suggests that it's not all "individuals can afford to take lower risks in richer countries".

Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).

I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).

[-]Roko6mo112

"If the world were unified around the priority of minimizing global catastrophic risk, I think that we could reduce risk significantly further by implementing a global, long-lasting, and effectively enforced pause on frontier AI development—including a moratorium on the development and production of some types of computing hardware"

This really needs to be shouted from the rooftops. In the public sphere, people will hear "responsible scaling policy" as "It's maximally safe to keep pushing ahead with AI" rather than "We are taking on huge risks because politicians can't be bothered to coordinate".

This really needs to be shouted from the rooftops.

I disagree. I think it's important that we shout from the rooftops that the existential risk from AI is real, but I disagree that we should shout from the rooftops that a sufficiently good pause would solve it (even though I agree with Paul that it is true). I talk about this in this comment.

Historically, I think that a lot of causes have been hurt by a sort of purity-testing where scientists are forced to endorse the most extreme policy, even if it's not the best policy, on the idea that it would solve the problem in theory if you had a magic button that enacted it. Consider, for example, the idea that climate scientists should all have to endorse the idea that, if we ended capitalism, it would solve climate change. Though true, I do not think that would help the cause of climate change! Even if climate change were enough of an existential risk that it was worth sacrificing our entire economy for it (as is maybe true of AI risk), it would still not be the case that advocating for that would be at all helpful, because there are much more effective ways of addressing climate change that starting a communist revolution.

I think everyone should be clear about what they think the risks are, but I think forcing people to publicly endorse policies that they don't endorse in practice just because they would solve the problem in theory is not a recipe for policy success.

I think forcing people to publicly endorse policies

Saying "If this happened, it would solve the problem" is not to endorse a policy. (though perhaps literally shouting from the rooftops might be less than ideal)

It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.

It's reasonable to say that this might not be the most effective communication strategy. (though I think on balance I'd disagree)
It's not reasonable to say that this amounts to publicly endorsing a policy.

...if we ended capitalism, it would solve climate change. Though true...

This seems an unhelpful parallel, first because it's not clearly true. (In particular "ended capitalism" isn't "ended capitalism, and replaced it with communism", nor "ended capitalism overnight without a plan to replace it").
Second, because the proposal in this case is to not actively enact a radically disruptive change to society.

The logic of the point you're making is reasonable, but the parallel has a bunch of baggage that reduces overall clarity IMO.

...because there are much more effective ways of addressing climate change than starting a communist revolution...

This part isn't even a parallel: here even if successful the communist revolution wouldn't be most effective. However if successful a sufficiently good pause would be most effective.

It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.

That's a lot of nuance that you're trying to convey to the general public, which is a notoriously hard thing to do.

Hmm. Perhaps the thing I'd endorse is more [include this in every detailed statement about policy/regulation], rather than [shout it from the rooftops].

So, for example, if the authors agree with the statement, I think this should be in:

  • ARC Evals' RSP post.
  • Every RSP.
  • Proposals for regulation.
  • ...

I'm fine if we don't start printing it on bumper stickers.

The outcome I'm interested in is something like: every person with significant influence on policy knows that this is believed to be a good/ideal solution, and that the only reasons against it are based on whether it's achievable in the right form.

If ARC Evals aren't saying this, RSPs don't include it, and many policy proposals don't include it..., then I don't expect this to become common knowledge.
We're much less likely to get a stop if most people with influence don't even realize it's the thing that we'd ideally get.

[-]Dave O6mo10-2

I think there are two paths, roughly, that RSPs could send us down. 

  1. RSPs are a good starting point. Over time we make them more concrete, build out the technical infrastructure to measure risk, and enshrine them in regulation or binding agreements between AI companies. They reduce risk substantially, and provide a mechanism whereby we can institute a global pause if necessary, which seems otherwise infeasible right now.
     
  2. RSPs are a type of safety-washing. They provide the illusion of a plan, but as written they are so vague as to be meaningless. They let companies claim they take safety seriously but don't meaningfully reduce risk, and in fact may increase it by letting companies skate by without doing real work, rather than forcing companies to act responsibly by just not developing a dangerous uncontrollable technology.

If you think that Anthropic and other labs that adopt these are fundamentally well meaning and trying to do the right thing, you'll assume that we are by default heading down path #1.  If you are more cynical about how companies are acting, then #2 may seem more plausible.

My feeling is that Anthropic et al are clearly trying to do the right thing, and that it's on us to do the work to ensure that we stay on the good path here, by working to deliver the concrete pieces we need, and to keep the pressure on AI labs to take these ideas seriously.  And to ask regulators to also take concrete steps to make RSPs have teeth and enforce the right outcomes. 

But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.

But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.

This seems wrong to me. We can say all kinds of things, like:

  • Are these RSPs actually effective if implemented? How could they be better? (Including aspects like: how will this policy be updated in the future? What will happen given disagreements?)
  • Is there external verification that they are implemented well?
  • Which developers have and have not implemented effective and verifiable RSPs?
  • How could employees, the public, and governments push developers to do better?

I don't think we're just sitting here and rolling a die about which is going to happen, path #1 or path #2. Maybe that's right if you just are asking how much companies will do voluntarily, but I don't think that should be the exclusive focus (and if it was there wouldn't be much purpose to this more meta discussion).  One of my main points is that external stakeholders can look at what companies are doing, discuss ways in which it is or isn't adequate, and then actually push them to do better (and build support for government action to demand better). That process can start immediately, not at some hypothetical future time.

If you think that Anthropic and other labs that adopt these are fundamentally well meaning and trying to do the right thing, you'll assume that we are by default heading down path #1. If you are more cynical about how companies are acting, then #2 may seem more plausible.

I disagree that what you think about a lab's internal motivations should be very relevant here. For any particular lab/government adopting any particular RSP, you can just ask, does having this RSP make it easier or harder to implement future good legislation? My sense is that the answer to that question should mostly depend on whether the substance of the RSP is actually better-than-nothing or not, and what your general models of politics are, rather than any facts about people's internal motivations—especially since trying to externally judge the motivation of a company with huge PR resources is a fundamentally fraught thing to do.

Furthermore, my sense is that, most of the time, the crux here tends to be more around models of how politics works. If you think that there's only a very narrow policy window to get in some policy and if you get the wrong policy in you miss your shot, then you won't be willing to accept an RSP that is good but insufficient on its own. I tend to refer to this as the "resource mindset"—you're thinking of political influence, policy windows, etc. as a limited resource to be spent wisely. My sense, though, is that the resource mindset is just wrong when applied to politics—the right mindset, I think, is a positive-sum mindset, where small better-than-nothing policy actions yield larger, even-better-than-nothing policy actions, until eventually you build up to something sufficient.

Certainly I could imagine situations where an RSP is crafted in such a way as to try to stymie future regulation, though I think doing so is actually quite hard:

  • Governments have sovereignty, so you can't just restrict what they'll do in the future.
  • Once a regulatory organization exists for something, it's very easy to just give it more tasks, make it stricter, etc., and much harder to get rid of it, so the existence of previous regulation generally makes new regulation easier not harder.
  • At least in democracies, leaders regularly come and go, and tend to like to get their new thing passed without caring that much about repealing the old thing, so different overlapping regulations can easily pile up.

Of course, that's not to say that we shouldn't still ask for RSPs that make future regulation even more likely to be good, e.g. by:

  • Not overclaiming about what sort of stuff is measurable (e.g. not trying to formalize simple metrics for alignment that will be insufficient).
  • Leaving open clear and obvious holes to be filled later by future regulation.
  • etc.

In My views on “doom” you wrote:

Probability of messing it up in some other way during a period of accelerated technological change (e.g. driving ourselves crazy, creating a permanent dystopia, making unwise commitments…): 15%

Do you think these risks can also be reduced by 10x by a "very good RSP"? If yes, how or by what kinds of policies? If not, isn't "cut risk dramatically [...] perhaps a 10x reduction" kind of misleading?

It concerns me that none of the RSP documents or discussions I've seen talked about these particular risks, or "unknown unknowns" (other risks that we haven't thought of yet).

I'm also bummed that "AI pause" people don't talk about these risks either, but at least an AI pause would implicitly address these risks by default, whereas RSPs would not.

I don't think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I'm not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that's a very small impact). I think that in that regime random political and social consequences of faster or slower technological development likely dominate the direct effects from becoming better prepared over time. I would have the same view in retrospect about e.g. a possible pause on AI development 6 years ago. I think at that point the amount of quality-adjusted work on alignment was probably higher than the quality-adjusted work on these kinds of risks today, but still the direct effects on increasingly alignment preparedness would be pretty tiny compared to random other incidental effects of a pause on the AI landscape.

if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that’s a very small impact

I guess from my perspective, the biggest impact is the possibility that the idea of better preparing for these risks becomes a lot more popular. An analogy with Bitcoin comes to mind, where the idea of cryptography-based distributed money languished for many years, known only to a tiny community, and then was suddenly everywhere. An AI pause would provide more time for something like that to happen. And if the idea of better preparing for these risks was actually a good one (as you seem to think), there's no reason why it couldn't (or was very unlikely to) spread beyond a very small group, do you agree?

Thanks for writing this.

I'd be interested in your view on the comments made on Evan's RSP post w.r.t unknown unknowns. I think aysja put it best in this comment. It seems important to move the burden of proof.

Would you consider "an unknown unknown causes a catastrophe" to be a "concrete way in which they fail to manage risk"? Concrete or not, this seems sufficient grounds to stop, unless there's a clear argument that a bit more scaling actually helps for safety. (I'd be interested in your take on that - e.g. on what speed boost you might expect with your own research, given AI assistants of some level)

By default, I don't expect the "affirmative case for safety that will require novel science" to be sufficient if it ends up looking like "We developed state of the art tools that address all known problems, and we don't expect others".

On the name, it's not 'responsible' that bothers me, but rather 'scaling'.
"Responsible x-ing policy" gives the strong impression that x-ing can be done responsibly, and that x-ing will continue. I'd prefer e.g. "Responsible training and deployment policy". That way scaling isn't a baked in presumption, and we're naming things that we know can be done responsibly.

Unknown unknowns seem like a totally valid basis for concern.

But I don't think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop.  Without further elaboration I don't think "unknown unknowns could cause a catastrophe" is enough to convince governments (or AI developers) to take significant actions.

I think RSPs make this situation better by pushing developers away from vague "Yeah we'll be safe" to saying "Here's what we'll actually do" and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.

My own take is that there is small but non-negligible risk before Anthropic's ASL-3. For my part I'd vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I'm not the median voter or decision-maker here (nor is Anthropic), and so I'll say my piece but then move on to trying to convince people or to find a compromise that works.

The specific conversation is much better than nothing - but I do think it ought to be emphasized that solving all the problems we're aware of isn't sufficient for safety. We're training on the test set.[1]
Our confidence levels should reflect that - but I expect overconfidence.

It's plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].
My presumption is that without good coordination (a necessary condition being cautious decision-makers), things will go badly.

RSPs seem likely to increase the odds we get some international coordination and regulation. But to get sufficient regulation, we'd need the unknown unknowns issue to be covered at some point. To me this seems simplest to add clearly and explicitly from the beginning. Otherwise I expect regulation to adapt to issues for which we have concrete new evidence, and to fail to adapt beyond that.

Granted that you're not the median voter/decision-maker - but you're certainly one of the most, if not the most, influential voice on the issue. It seems important not to underestimate your capacity to change people's views before figuring out a compromise to aim for (I'm primarily thinking of government people, who seem more likely to have views that might change radically based on a few conversations). But I'm certainly no expert on this kind of thing.

  1. ^

    I do wonder whether it might be helpful not to share all known problems publicly on this basis - I'd have somewhat more confidence in safety measures that succeeded in solving some problems of a type the designers didn't know about.