Big fan of this post. One thing worth highlighting IMO: The post assumes that governments will not react in time, so it's mostly up to the labs (and researchers who can influence the labs) to figure out how to make this go well.
TBC, I think it's a plausible and reasonable assumption to make. But I think this assumption ends up meaning that "the plan" excludes a lot of the work that could make the USG (a) more likely to get involved or (b) more likely to do good and useful things conditional on them deciding to get involved.
Here's an alternative frame: I would call the plan described in Marius's post something like the "short timelines plan assuming that governments do not get involved and assuming that technical tools (namely control/AI-automated AI R&D) are the only/main tools we can use to achieve good outcomes."
You could imagine an alternative plan described as something like the "short timelines plan assuming that technical tools in the current AGI development race/paradigm are not sufficient and governance tools (namely getting the USG to provide considerably more oversight into AGI development, curb race dynamics, make major improvements to security) are the only/main tools we can use to achieve good outcomes." This kind of plan would involve a very different focus.
Here are some examples of things that I think would be featured in a "government-focused" short timelines plan:
One possible counter is that under short timelines, the USG is super unlikely to get involved. Personally, I think we should have a lot of uncertainty RE how the USG will react. Examples of factors here: (a) new Administration, (b) uncertainty over whether AI will produce real-world incidents, (c) uncertainty over how compelling demos will be, (d) chatGPT being an illustrative example of a big increase in USG involvement that lots of folks didn't see coming, and (e) examples of the USG suddenly becoming a lot more interested in a national security domain (e.g., 9/11--> Patriot Act, recent Tik Tok ban), (f) Trump being generally harder to predict than most Presidents (e.g., more likely to form opinions for himself, less likely to trust the establishment views in some cases).
(And just to be clear, this isn't really a critique of Marius's post. I think it's great for people to be thinking about what the "plan" should be if the USG doesn't react in time. Separately, I'd be excited for people to write more about what the short timelines "plan" should look like under different assumptions about USG involvement.)
I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.
Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research.
To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.
By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions.
Curious why you say this, in particular the bit about “accurately reflects the reasons for the model’s actions.” How do you know that? (My impression is that this sort of judgment is often based on ~common sense/folk reasoning, ie, that we judge this in a similar way to how we might judge whether another person took sensical actions—was the reasoning internally consistent, does it seem predictive of the outcome, etc.? Which does seem like some evidence to me, although not enough to say that it accurately reflects what's happening. But perhaps I'm missing something here).
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).
These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the "let's make sure CoT is faithful" part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes.
Something like the OpenPhil AI worldview contest: https://www.openphilanthropy.org/research/announcing-the-winners-of-the-2023-open-philanthropy-ai-worldviews-contest/
Or the ARC ELK prize: https://www.alignment.org/blog/prizes-for-elk-proposals/
In general, I wouldn't make it too complicated and accept some arbitrariness. There is a predetermined panel of e.g. 5 experts and e.g. 3 categories (feasibility, effectiveness, everything else). All submissions first get scored by 2 experts with a shallow judgment (e.g., 5-10 minutes). Maybe there is some "saving" mechanism if an overeager expert wants to read plans that weren't assigned to them. Everything in the top N% then gets scored by all experts with a more detailed review. Then, there is a final ranking.
I'd hope that the time spent per expert is only 5-10 hours in total. I'd be fine with missing a bunch of posts that contain good ideas that are badly communicated or otherwise easy to miss on the shallow review.
My main goal with the contest would be that writing a good plan and communicating it clearly is incentivized.
This is a low-effort post. I mostly want to get other people’s takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.
I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
If we take AGI seriously, I feel like the AGI companies and the rest of the world should be significantly more prepared, and I think we’re now getting into the territory where models are capable enough that acting without a clear plan is irresponsible.
In this post, I want to ask what such a short timeline plan could look like. Intuitively, if an AGI lab came to me today and told me, “We really fully believe that we will build AGI by 2027, and we will enact your plan, but we aren’t willing to take more than a 3-month delay,” I want to be able to give the best possible answer. I list some suggestions but I don’t think they are anywhere near sufficient. I’d love to see more people provide their answers. If a funder is interested in funding this, I’d also love to see some sort of “best short-timeline plan prize” where people can win money for the best plan as judged by an expert panel.
In particular, I think the AGI companies should publish their detailed plans (minus secret information) so that governments, academics, and civil society can criticize and improve them. I think RSPs were a great step in the right direction and did improve their reasoning transparency and preparedness. However, I think RSPs (at least their current versions) are not sufficiently detailed and would like to see more fine-grained plans.
In this post, I generally make fairly conservative assumptions. I assume we will not make any major breakthroughs in most alignment techniques and will roughly work with the tools we currently have. This post is primarily about preventing the worst possible worlds.
Short timelines are plausible
By short timelines, I broadly mean the timelines that Daniel Kokotajlo has talked about for years (see e.g. this summary post; I also found Eli Lifland’s argument good; and liked this and this). Concretely, something like
I think there are some plausible counterarguments to these timelines, e.g. Tamay Besiroglu argues that hardware limitations might throttle the recursive self-improvement of AIs. I think this is possible, but currently think software-only recursive self-improvements are plausible. Thus, the timelines above are my median timelines. I think there could be even faster scenarios, e.g. we could already have an AI that replaces a top researcher at an AGI lab by the end of 2025 or in 2026.
There are a couple of reasons to believe in such short timelines:
In conclusion, I can understand why people would have longer median timelines, but I think they should consider the timelines described above at least as a plausible scenario (e.g. >10% likely). Thus, there should be a plan.
What do we need to achieve at a minimum?
There is a long list of things we would love to achieve. However, I think there are two things we need to achieve at a minimum or the default scenario is catastrophic.
The default scenario I have in mind here is broadly the following: There is one or a small number of AGI endeavors, almost certainly in the US. This project is meaningfully protected by the US government and military both for physical and cyber security (perhaps not at the maximal level of protection, but it’s a clear priority for the US government). Their most advanced models are not accessible to the public. This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year. This research is then used to ensure that the next system is sufficiently aligned/controlled and that the model weights continue to be secure.
This feels like a minimal stable solution. Of course, I would prefer if we were more prepared. I also think there are other paths than building more aligned & powerful successors that AI developers could take with human-level AI systems. However, for this post, I will make the conservative assumption that nobody is willing to stop racing. Thus, conditional on a capability race, I’d want them to use their first AGIs to do lots of alignment, control, and security research.
There is a very long list of further assumptions and desiderata that I don’t elaborate on in this post, e.g.
I want as many of those assumptions to hold. However, in this post, I want to primarily focus on the most minimal version.
Making conservative assumptions for safety progress
If we only have 2-3 years until AGI, I think it’s reasonable to make conservative assumptions about AI safety progress. Concretely:’
I think it is plausible that we will make a lot of alignment progress if we get to the aligned-enough human-level AI systems (i.e. roughly 2028 in my timeline above). This post is mostly concerned with how we get to this point so that we are set up for success and don’t create irreversible catastrophic harm on the way.
So what's the plan?
The following is my best guess for what we should do AT LEAST, but I haven’t thought about it deeply and would hope that the current plans of everyone who’s trying to build AGI are significantly better. This is very much a “throw the kitchen sink at the problem” approach with almost no theoretical grounding. I think a better version of this approach would do a few things differently, e.g.
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. The second layer is still very important and reduces risk substantially, but is a lower priority than the first layer, in my opinion. Note that this is already very triaged, i.e. there is a third, fourth, and fifth layer that I think a great plan would contain that I don’t even mention here because the post explicitly makes very conservative assumptions about the success of various alignment strategies.
Intuitively, if we were building a car, this plan is the “could we at least add breaks”-plan that doesn’t even consider techniques like airbags, ABS, ESC, seatbelts, or crumple zones.
Layer 1
Everything in this section seems very important to me. In short,
Keep a paradigm with faithful and human-legible CoT
In the grand scheme of things, we got very lucky that a large part of the reasoning of current AI systems is directly visible and understandable in plain sight (assuming that the hidden CoT in o1 and o3 is human-legible).
By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions. It also excludes situations where the model does a lot of internal reasoning that differs from what the CoT states.
At a minimum, I think we should
Assuming the legibility tax isn’t that large (e.g. less than 2x), I feel like we’d be shooting ourselves in the foot if we were to build AGI that doesn’t use human-legible CoT from the point where we are now even under these conservative assumptions.
I think my view would be significantly changed if there was evidence that high-compute RL, as used in o1 and o3, leads to non-legible CoT, e.g. where the model develops its own language that is not trivially comprehensible to us. In such a world, I would want
Relatedly, Daniel Kokotajlo has recently written about a concrete proposal in this direction in “Why Don't We Just... Shoggoth+Face+Paraphraser?”
Significantly better (CoT, action & white-box) monitoring
In the world where we have models with human-legible CoT, my main line of defense would go through really detailed monitoring (and control that doesn’t assume human-legible CoT would be the second line of defense).
Better monitoring seems like an incredibly cost-effective yet neglected approach. Clearly, it would be desirable to have multiple simple monitors analyzing model outputs for a number of generic failure modes and bad behaviors at all times. The kinds of behaviors we could be tracking could, for example, include:
There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini) or simple white-box probes trained to detect these problematic behaviors (e.g. linear probes or SAE probes). I would also want to both monitor the actions itself (e.g. tool calls) as well as the reasoning that led to them.
I expect that some model providers are already using online black box monitors for jailbreaks, so there already is a proof of concept. Simple linear probes are incredibly cost-effective (just a simple vector product) and we could run thousands of probes in parallel at almost zero additional overhead.
While these seem like simple approaches, I think nobody has really tried very hard to make them work in practice. Furthermore, while they are conceptually simple, they still require a lot of iteration and tuning, e.g. we would need to create good datasets to train and evaluate our probes and calibrate the thresholds such that they trigger in the correct cases but don’t provide too many false positives. Especially in the case where it is hard to build really good monitoring systems, we should know that as soon as possible so that AI developers can put more effort into them.
I think every AI lab should have a full team just to build these monitors, make sure they work as intended, analyze their outcomes, make them efficient to use, and more. I would prioritize efforts to build high-sensitivity monitors for high-stakes situations, e.g. developer-internal agentic deployments, over good monitors for average-stakes deployments, e.g. most chatbot use-cases.
My general feeling is that these monitors would both directly prevent harm but also unlock a lot of other benefits, e.g. we could understand the effectiveness of our alignment techniques faster and more thoroughly.
Control (that doesn’t assume human-legible CoT)
In the world where we have models without human-legible CoT, my main line of defense would go through control that doesn’t assume human-legible CoT (and monitoring would be the second line of defense).
I specifically mean the Redwood Research definition of control, where we make the worst-case assumption that the model is scheming and it is not necessarily visible to us in CoT. We then attempt to develop protocols that maximize the usefulness of these models while minimizing risks.
While CoT monitoring might be useful for some control approaches, I specifically want to see further development of techniques that don’t assume that the CoT is legible to humans, as done by Redwood Research.
In general, I think the kind of work required for control is something like what Redwood has been doing just 100x more and applied to frontier systems by the AGI labs. Some concrete suggestions include:
This is just a small subset of directions in control. Afaik there will be a significantly more detailed list of control projects and open problems soon.
Much deeper understanding of scheming
By the time we have AI systems that drastically speed up or replace core researchers at AI lab, I think we should have a much better understanding of scheming. Concretely, we should
I expect that a lot of this work is on model organisms, i.e. where researchers actively study one particular part of scheming in isolation similar to Sleeper Agents or Ryan Greenblatt+Anthropic’s work on Alignment faking in large language models.
Evals
Evals are a cornerstone of a lot of frontier safety commitments, and I think they are a useful evidence-gathering tool that can inform other strategies. Within this plan, evals fulfill two important roles.
Overall, I think there should be significantly more resources in evals, see e.g. the evals gap (other safety efforts should also be funded more).
Security
Improved information and physical security seems like an important layer of defense for many failure modes. We need to ensure that model weights and algorithmic secrets cannot get stolen, that models cannot exfiltrate themselves, and that no internal employees can run large, unmonitored finetuning runs. I think the report on securing model weights is great and we should basically treat its recommendations as a good default (though it’s plausible that more thinking would suggest a slightly different strategy). Buck Shlegeris’ post, “Access to powerful AI might make computer security radically easier,” also makes important recommendations.
I guess that the core challenge under short timelines will be correct prioritization. It is likely impossible to implement everything that would be suggested in the SL5 recommendations of the “securing model weights” report within the next 2-3 years. Thus, triaging seems necessary. I could also imagine that a lot of the security strategies that are important right around the point when an AI company automates their AI R&D have a few years of development time and might not feel immediately required right now, so they will only exist if an AI developer is acting now with a lot of foresight.
Finally, I’d have hoped that the internal infosec teams at frontier AI companies get access to frontier models ~3 months before external deployment both so they can defensively harden their infosec as well as conduct offensive testing. My guess is that this is not the norm, but in a sane world, an AI company would do something like using their own models to improve their own defensive capabilities before they release it. I expect this is largely blocked by inertia and company-internal politics. So, making this a norm as soon as possible would likely go a long way.
Unfortunately, I’m not an expert in security and would welcome concrete suggestions for a short-timeline security strategy from someone who is.
Layer 2
Everything in this layer still seems quite important to me, but it is less important than those in layer 1.
Improved near-term alignment strategies
I think RLHF and RLAIF empirically work better than I would have expected in ~2020. For example, they make the model more robustly behave as I want it to, the model can infer my intentions better, and the HHH training generalizes further than I had anticipated.
For example, I did not expect that the models would internalize HHH strongly enough to be willing to scheme for it, e.g. in Alignment faking in large language models and in our follow-up experiments on in-context scheming. While I think this should be considered evidence that current alignment techniques go “deeper” than some people thought, it also opens up new problems.
Since our existing techniques can already introduce non-myopic and non-corrigible tendencies into the models, I think we should spend significantly more effort ensuring that these techniques keep the models myopic and corrigible. Furthermore, I think we should understand which kind of goals the models get in a lot more detail, e.g. in our follow-up experiments, it seemed that “the models just want to be maximally helpful” was not the best description of the goals revealed preferences and at least to some extent they also showed some self-preservation without this being instrumental to their helpfulness. I would like to understand this in much more detail and build fine-grained “cognitive profiles” that we closely track for various alignment techniques.
I also expect that the strategies for internal and external use should differ meaningfully. For example, making models jailbreak-resistant could decrease their corrigibility. However, if you intend to deploy the model only to highly trusted internal staff, you don’t have to train it against jailbreaks. I’m currently unsure which training techniques you would want to use for agentic long-horizon internal deployments, e.g. because jailbreaks might be less relevant but taking non-power-seeking actions is much more relevant.
Likely, the main bottleneck to success is building tests and environments that are stable enough that we’d be happy to iterate against them (not necessarily train against them, but at least regularly evaluate on them).
Continued work on interpretability, scalable oversight, superalignment & co
I like efforts going into improving fundamental research in interpretability, scalable oversight, superalignment & co (this is broadly the category for empirical research that is not covered in layer 1) and I want people to continue to make progress on the fundamentals. However, I think in a short-timeline world, figuring out the fundamentals first is too risky of a bet and I deprioritize them in this plan.
There are three types of efforts I would want to see in particular:
My current guess is that none of these techniques are good enough that we would want to put a lot of confidence in them but I think they could be used as a supporting argument in an overall defense-in-depth approach. Of course, I’d love to be proven wrong and find out that any of them could be fully load-bearing.
Reasoning transparency
I think AI companies should make their reasoning around safety plans for human-level AGI more transparent. For example, voluntary commitments like safety frameworks were a good step in the right direction. They allowed externals to understand the broad strokes of these plans and criticize them. I think not every criticism was accurate or helpful, but I expect it to be highly net positive. I also liked Anthropic's “Three Sketches of ASL-4 Safety Case Components” and would like to see more efforts in this direction.
Realistically, I think the priorities should be roughly:
Safety first culture
I think an organization that aims to build AGI should ideally have internalized a safety-first approach at all levels. Everyone, from leadership to interns, has to understand the magnitude of what they’re working on and the implications if they make mistakes. The kind of culture I have in mind can likely only be found in militaries today, if at all. I think we’re currently very far away from that culture and not on track to reach it.
A few examples of my desired culture include:
I think it’s plausible that the question of whether AI will be the best or worst technology that humanity has built may come down to a few high-stakes decisions that are made with little time and under high pressure. I find it hard to see a world where we make the best decisions unless safety is top of mind for all people involved.
Assuming short timelines, I think it’s likely impossible to reach my desired levels of safety culture. This implies that
Known limitations and open questions
I already list some of the limitations above, but want to give a bit more context here.