The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail.
Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.
Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break...
Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.
I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.
...At worst it just requires that the initial few batches of uploa
(Cross-posted from X, intended for a general audience.)
There’s a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples. Longpost:
THE FIRST PIECE of Econ anti-pedagogy is hiding in the words “labor” & “capital”. These words conflate a superficial difference (flesh-and-blood human vs not) with a bundle of unspoken assumptions and intuitions, which will all get broken by Artificial General Intelligence (AGI).
By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after...
For obvious reasons, we should care a great deal whether the exponentially-growing mass of AGIs-building-AGIs is ultimately trying to make cancer cures and other awesome consumer products (things that humans view as intrinsically valuable / ends in themselves), versus ultimately trying to make galaxy-scale paperclip factories (things that misaligned AIs view as intrinsically valuable / ends in themselves).
From my perspective, I care about this because the former world is obviously a better world for me to live in.
But it seems like you have some extra reaso...
Drone countermeasures are an idle hope. The only real counter to drones is more drones.
Lasers, shotguns, tank redesign [no holes!], nets, counter-drones, flak etc will all be part of the arsenal surely but thinking drone countermeasures are going to restore the previous generation's war doctrine is as silly as thinking that metallurgy innovations will reverse the gunpowder age.
The future present of warfare is drones, drones, drones.
Twitter | Microsite | Apollo Blog | OpenAI Blog | Full paper
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type...
Preamble:
5050 is a program that helps great scientists, researchers, and engineers become great founders. It’s helped launch 78 companies, and now we’re turning our attention to one of the most important challenges of our time: building a safe and aligned AI future.
We created the 5050 AI track to support founders building for this new world. Below are five ideas we’d love to back right now. We believe these are great opportunities to build safe AI, but they’re not the only ones. If you’re building in this space, we want to hear from you.
What's inside:
...Mission Control for Agents
Scalable Oversight for Multi-Agent Systems
An independent startup focused on scalable oversight could build the infrastructure and tooling needed to make multi-agent systems production-ready. Beyond selling tools, this company would invest in
Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?
In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama...
Thanks! I somehow missed this paper, looks interesting!
Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.
Did you try any experiments with a synthetic second hop instead of a synthetic first hop?
We did not, but Jiahai Feng had an experiment like this in his paper.