Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards. Key improvements include new capability thresholds to indicate when we will upgrade our safeguards, refined processes for evaluating model capabilities and the adequacy of our safeguards (inspired by safety case methodologies), and new measures for internal governance and external input. By learning from our implementation experiences and drawing on risk management practices used in other high-consequence industries, we aim to better prepare for the rapid pace of AI...

While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.

Our favorite papers

  • Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
    • Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities.
    • We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like
    • We tested the calibration of their new methodologies in practice in Hojmark et al.,
...

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:

  • Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of
...
3Vivek Hebbar
Do you want to try playing this game together sometime?
2Daniel Kokotajlo
Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)

3Vivek Hebbar
Why are we doing RL if we just want imitation?  Why not SFT on expert demonstrations? Also, if 10 episodes suffices, why is so much post-training currently done on base models?

This post starts out pretty gloomy but ends up with some points that I feel pretty positive about.  Day to day, I'm more focussed on the positive points, but awareness of the negative has been crucial to forming my priorities, so I'm going to start with those.  It's mostly addressed to the EA community, but is hopefully somewhat of interest to LessWrong and the Alignment Forum as well.

My main concerns

I think AGI is going to be developed soon, and quickly.  Possibly (20%) that's next year, and most likely (80%) before the end of 2029. These are not things you need to believe for yourself in order to understand my view, so no worries if you're not personally convinced of this.

(For what it's worth, I did arrive at...

I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest conce... (read more)

2Chris_Leong
I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge.  The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.
12RobertM
Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?  Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly. Separately, I don't understand how encouraging human-specific industries is supposed to work in practice.  Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it?  (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)
4Vladimir Nesov
Healthcare in this general sense is highly relevant to machines. Conversely, sufficient tech to upload/backup/instantiate humans makes biology-specific healthcare (including life extension) mostly superfluous. The key property of machines is initial advantage in scalability, which quickly makes anything human-specific tiny and easily ignorable in comparison, however you taxonomize the distinction. Humans persevere only if scalable machine sources of power (care to) lend us the benefits of their scale. Intent alignment for example would need to be able to harness a significant fraction of machine intent (rather than being centrally about human intent).
Load More