A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs.
The following is a (lightly edited version of a) series of notes I sent Joe Carlsmith about his essay, Can you control the past?. It's addressed to Joe, but it seems worth publishing here while I'm on the topic of decision theory. I’ve included some of his comments, and my replies, below.
I only recently skimmed Can you control the past?, and have a couple notes that you may or may not be interested in. (I'm not under the impression that this matters a ton, and am writing this recreationally.)
First: this is overall a great review of decision theories. Better than most I've seen. Nice.
Now, onto some more substansive points.
I think a bunch of your sense of oddness about the "magic" that "you can write on whiteboards...
GDM has published an AI Control Roadmap! From the executive summary:
...
We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain.We focus on system-level mitigations that limit the harm a misaligned AI system could cause. Specifically, this report provides:
• Threat modelling: Taking inspiration from cybersecurity, we adopt a conservative, worst-case approach to threat modelling throughout this paper, and assume a hypothetical AI adversary pursuing undesirable goals in internal deployment. We introduce TRAIT&R, a taxonomy of tactics and techniques available to such a hypothetical AI adversary, building on the established security framework MITRE ATT&CK. We categorise new hypothetical threats into three core areas:
Yayyy!
(based on the summary — haven't read the paper)
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Great, thank you! Timing seems updated, but also fwiw flops.symmetrize is still not there: https://www.aicrowd.com/challenges/arc-white-box-estimation-challenge-2026/submissions/310914
Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.
Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. By doing so, we can study how...
Cool work. What kinds of situations trigger CoT evaluation awareness mentions in simulated deployment? Does this overlap heavily with the distribution of scenarios that are interesting from an alignment or safety perspective? Where those distributions come apart is quite interesting to me.
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can't possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn't). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn't run into this problem.
Ok, so I think this argument probably works if all EDT agents have t...
On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll often say that it would come about through being in too much of a rush, or careless programmers, or bad actors, etc., as opposed to the simpler...
In the framing of the post, I think much (most?) of the disagreement is downstream of whether we'll even choose to pursue the kind of ASI for which the theoretical arguments dominate the prosaic LLM-style safety arguments. LLMs or other non-limits-of-intelligence technologies with better safety properties could very plausibly scale far enough to satisfy the wants of people developing AI and/or end competitive pressures to build more ASI-like things.