ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.
Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. By doing so, we can study how...
Cool work. What kinds of situations trigger CoT evaluation awareness mentions in simulated deployment? Does this overlap heavily with the distribution of scenarios that are interesting from an alignment or safety perspective? Where those distributions come apart is quite interesting to me.
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can't possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn't). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn't run into this problem.
Ok, so I think this argument probably works if all EDT agents have t...
On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll often say that it would come about through being in too much of a rush, or careless programmers, or bad actors, etc., as opposed to the simpler...
In the framing of the post, I think much (most?) of the disagreement is downstream of whether we'll even choose to pursue the kind of ASI for which the theoretical arguments dominate the prosaic LLM-style safety arguments. LLMs or other non-limits-of-intelligence technologies with better safety properties could very plausibly scale far enough to satisfy the wants of people developing AI and/or end competitive pressures to build more ASI-like things.
For those who disagree-voted: I want to understand why you disagree. Presumably it's with the parenthetical. Is it just that you're less confident in current Claude's generalization behavior? Or that you actively expect it to be malign? Maybe you're picturing some sort of idealized reflection process that I'm not?
Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk.
Great, thank you! Timing seems updated, but also fwiw
flops.symmetrizeis still not there: https://www.aicrowd.com/challenges/arc-white-box-estimation-challenge-2026/submissions/310914