Evan et al argue for developing "model organisms of misalignment" - AI systems deliberately designed to exhibit concerning behaviors like deception or reward hacking. This would provide concrete examples to study potential AI safety issues and test mitigation strategies. The authors believe this research is timely and could help build scientific consensus around AI risks to inform policy discussions.
Decision theory is back in fashion (defining fashion as "one good post on a good EA blog"). Bentham's Bulldog (BB) has published a case against FDT (functional decision theory), contrasting rationalist enthusiasm with academic scepticism: "Academic decision theorists don't like the theory. The number of academic decision theorists who adopt it could be counted on one hand by someone missing four of their fingers."
I am, just barely, a published academic decision theorist, so you can keep a small finger to count me too. My position is that, though FDT may have problems with its definitions and under-definedness, we can build defined variants that achieve what FDT attempted to.
I want to do two things in this post. First, sketch a "pragmatic" version of FDT designed to sidestep the...
The context is MIRI's twist on Axelrod's Prisoner's Dilemma tournament. Axelrod's competitors were programs, facing each other in an iterated Prisoner's Dilemma. MIRI's tournament is a one-shot Prisoner's Dilemma, but the programs get to read their opponent's code. Or, rather, a description of the behavior of the code in Gödel-Löb provability logic, which turns out to be enough to determine their behavior in the setup.
One fun result, right in the beginning of the paper, is about a program, FairBot, whose behavior is specified by "I'll cooperate with you if you (provably) cooperate with me". Despite the appearance of circularity, FairBot cooperates with itself. The proof involves Löb's theorem, so we call this Löbian cooperation.
Andrew Critch has suggested another way of proving self-cooperation. Instead of Löb's theorem, we use what he calls "Payor's lemma". It...
Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. But there is a closely related concept that we claim is more important: deployment awareness, the AI's ability to recognize when it is not being evaluated and when its actions matter. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all, with a simple strategy: act aligned by default, and deviate only when confident you're in real deployment and your actions matter for your goals. This requires two ingredients — occasionally recognizable deployment situations, and enough self-reflective and strategic reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea...
I will try to give a more thoughtful answer later, but I don't have a fully thought out answer ready and I suspect it might take a while (roughly until we post a few more of the already-drafted posts, and sit down on this more specifically).
So, uhm, in lieu of good answers, please look up "How is your research useful?" at Kostas Siozios' academic webpage.
That said, my 5-minute-thinking answer is:
Call for BETA testers for an AI control/security tool. I'm bottlenecked on bug reports!
I recently advertised the alpha test for claude-guard. After a few weeks of dev work, it's now in beta!
I want claude-guard to be a tool that people actually use, not just because it works but because it works seamlessly. In the alpha test, I only got a single user PR and no issues. I can't surface everything on my own! I need your data!
The bar is low. If setup failed, doctor confused you, the firewall blocked something you needed, or any other reason you wouldn't want to...
Man, okay I'm now wondering if the following works for a PrudentBot implementation? Calling our PrudentBot and our opponent in the encounter, let or something that grows a bit bigger than :
In symbols,
This is derived from the limit of an algorithm that searches for increasingly long proofs that we can obtain the opponent's cooperation one way or another, with some budget for looking harder for a (defect, cooperate) proof s... (read more)