AI ALIGNMENT FORUM
Tags
AF

AI Control

•

Applied to Are Sparse Autoencoders a good idea for AI control? by Gerard Boxo 4d ago

•

Applied to Reduce AI Self-Allegiance by saying "he" instead of "I" by Knight Lee 7d ago

•

Applied to Measuring whether AIs can statelessly strategize to subvert security measures by Alex Mallen 11d ago

•

Applied to A toy evaluation of inference code tampering by Gunnar Zarncke 21d ago

•

Applied to The Queen’s Dilemma: A Paradox of Control by Raymond Arnold 1mo ago

•

Applied to Why imperfect adversarial robustness doesn't doom AI control by Raymond Arnold 1mo ago

•

Applied to Using Dangerous AI, But Safely? by Raymond Arnold 1mo ago

•

Applied to Sabotage Evaluations for Frontier Models by Buck Shlegeris 1mo ago

•

Applied to Win/continue/lose scenarios and execute/replace/audit protocols by Buck Shlegeris 1mo ago

•

Applied to Toward Safety Cases For AI Scheming by Mikita Balesni 2mo ago

•

Applied to Dario Amodei's "Machines of Loving Grace" sound incredibly dangerous, for Humans by Super AGI 2mo ago

•

Applied to A Brief Explanation of AI Control by Aaron_Scher 2mo ago

•

Applied to Coup probes: Catching catastrophes with probes trained off-policy by Stefan Heimersheim 2mo ago

•

Applied to Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming by Buck Shlegeris 3mo ago

•

Applied to Schelling game evaluations for AI control by Olli Järviniemi 3mo ago

•

Applied to How to prevent collusion when using untrusted models to monitor each other by Buck Shlegeris 3mo ago

•

Applied to Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 3mo ago