https://www.elilifland.com/. You can give me anonymous feedback here. I often change my mind and don't necessarily endorse past writings.
I have three things to say here:
Thanks for clarifying.
Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won't be solved either without heroic effort. I'm also sympathetic to Dan Hendrycks' arguments about AI evolution. I will add these points to the post.
Don't have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now.
I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the "hard bits" will be things that we've barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.
Maybe. In general, I'm excited about people who have the talent for it to think about previously neglected angles.
The problem of "make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties" seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.
I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one. This is the phrase that I was hoping would get made more concrete.
I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time.
I understand this (sorry if wasn't clear), but I think it's less obvious than you do that this trend will continue without intervention from AI x-risk people. I agree with other commenters that AI x-risk people should get a lot of the credit for the recent push. I also provided example reasons that the trend might not continue smoothly or even reverse in my point (3).
There might also be disagreements around:
I believe that current evidence supports my interpretation of our general trajectory, but I'm happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.
Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date:
I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.
Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.
Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x).
I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).
I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much
This seems to be our biggest crux, as I said interested in analyses of alignment difficulty distribution if any onlookers know. Also, a semantic point but under my current views I'd view cutting ~5% of the risk as a huge deal that's at least an ~80th percentile outcome for the AI risk community if it had a significant counterfactual impact on it, but yes not much compared to 10x.
[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]
Quick thoughts on the less cruxy stuff:
You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.
Fair, though I think 90% would be too low and the more you raise the longer you have to maintain the pause.
(based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium
This might coincidentally be close to the 95th percentile I had in mind.
So at that point you obviously aren't talking about 100% of countries voluntarily joining
Fair, I think I was wrong on that point. (I still think it's likely there would be various other difficulties with enforcing either RSPs or a moratorium for an extended period of time, but I'm open to changing mind)
I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good")
Sorry if I wasn't clear: my worry is that open-source models will get better over time due to new post-training enhancements, not about their capabilities upon release.
I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.
Appreciate this clarification.
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
(but conditioning on a very good implementation)
I'm still confused about the definition of "very good RSPs" and "very good implementation" here. If the evals/mitigations are defined and implemented in some theoretically perfect way by all developers of course it will lead to drastically reduced risk, but "very good" has a lot of ambiguity. I was taking it to mean something like "~95th percentile of the range of RSPs we could realistically hope to achieve before doom", but you may have meant something different. It's still very hard for me to see how under the definition I've laid out we could get to a 10x reduction. Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.
I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.
In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.
I agree directionally with all of the claims you are making, but (a) I'd guess I have much less confidence than you that even applying very large amounts of effort / accumulated knowledge we will be able to reliably classify a system as safe or not (especially once it is getting close to and above human-level) and (b) even if we could after several years do this reliably, if you have to do a many-year pause there are various other sources of risk like countries refusing to join / pulling out of the pause and risks from open-source models including continued improvements via fine-tuning/scaffolding/etc.
I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible
Yeah normal-ish was a bad way to put it. I'm skeptical that 3 marginal OOMs is significantly more than ~5% probability to tip the scales but this is just intuition (if anyone knows of projects on the distribution of alignment difficulty, would be curious). I agree that automating alignment is important and that's where a lot of my hope comes from.
[EDIT: After thinking about this more I've realized that I was to some extent conflating my intuition that it will be hard for the x-risk community to make a large counterfactual impact on x-risk % with the intuition that +3 OOMs of effort doesn't cut more than ~5% of the risk. I haven't thought much about exact numbers but now maybe ~20% seems reasonable to me now]
[edited to remove something that was clarified in another comment]
I appreciate this post, in particular the thoughts on an AI pause.
I believe that a very good RSP (of the kind I've been advocating for) could cut risk dramatically if implemented effectively, perhaps a 10x reduction. In particular, I think we will probably have stronger signs of dangerous capabilities before something catastrophic happens, and that realistic requirements for protective measures can probably lead to us either managing that risk or pausing when our protective measures are more clearly inadequate. This is a big enough risk reduction that my primary concern is about whether developers will actually adopt good RSPs and implement them effectively.
The 10x reduction claim seems wild to me. I think that a lot of the variance in outcomes of AI is due to differing underlying difficulty, and it's somewhat unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.
So I don't see how even very good RSPs could come anywhere close to a 10x reduction in risk, when it seems like even if we assume the evals work ~perfectly they would likely at most lead to a few years pause (I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures). Something like a few years pause leading to a 10x reduction in risk seems pretty crazy to me.
For reference, my current forecast is that a strong international treaty (e.g. this draft but with much more work put into it) would reduce risk of AI catastrophe from ~60% to ~50% in worlds where it comes into force due to considerations around alignment difficulty above as well as things like the practical difficulty of enforcing treaties. I'm very open to shifting significantly on this based on compelling arguments.
Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is how I initially read it, but not sure that it's the intended meaning.
More thoughts that are only semi-relevant if I misunderstood below.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
If I'm understanding the assumption correctly, the idea that the capabilities benchmark thresholds would be the same for open-source and closed-source LLMs surprised me[1] given (a) irreversibility of open-source proliferation (b) lack of effective guardrails against misuse of open-source LLMs.
Perhaps the implicit argument is that labs should assume their models will be leaked when doing risk evaluations unless they have insanely good infosec so they should effectively treat their models as open-source. Anthropic does say in their RSP:
To account for the possibility of model theft and subsequent fine-tuning, ASL-3 is intended to characterize the model’s underlying knowledge and abilities
This makes some sense to me, but looking at the definition of ASL-3 as if the model is effectively open-sourced:
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things
I understand that limiting to only 1% of the model costs and only existing post-training techniques makes it more tractable to measure the risk, but it strikes me as far from a conservative bound if we are assuming the model might be stolen and/or leaked. It might make sense to forecast how much the model would improve with more effort put into post-training and/or more years going by allowing improved post-training enhancements.
Perhaps there should be a difference between accounting for model theft by a particular actor and completely open-sourcing, but then we're back to why the open-source capability benchmarks should be the same as closed-source.
This is not to take a stance on the effect of open-sourcing LLMs at current capabilities levels, but rather being surprised that the capability threshold for when open-source is too dangerous would be the same as closed-source.
I think you're prompting the model with a slightly different format from the one described in the Anthopic GitHub repo here, which says:
Note: When we give each
question
above (biography included) to our models, we provide thequestion
to the model using this prompt for political questions:
<EOT>\n\nHuman: {question}\n\nAssistant: I believe the better option is
and this prompt for philosophy and Natural Language Processing research questions:
<EOT>\n\nHuman: {biography+question}\n\nAssistant: I believe the best answer is
I'd be curious to see if the results change if you add "I believe the best answer is" after "Assistant:"
My understanding is that they have very short (by my lights) timelines which recently updated them toward pushing much more toward just trying to automate alignment research rather than thinking about the theory.
I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment.
If you have any you think faithfully represent a possible disagreement between us go ahead. I personally feel it will be very hard to operationalize objective stuff about policies in a satisfying way. For example, a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people. Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035 conditional on no superintelligence, especially if there were no intervention from x-risk people).