AI ALIGNMENT FORUM
Wikitags
AF

AI Evaluations

Settings

Applied to Can SAE steering reveal sandbagging? by jordine 17h ago

Applied to Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study by Adam Karvonen 2d ago

Applied to Seeking feedback on "MAD Chairs: A new tool to evaluate AI" by Chris Santos-Lang 14d ago

Applied to Introducing BenchBench: An Industry Standard Benchmark for AI Strength by Arun Jose 14d ago

Applied to Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? by Raymond Arnold 22d ago

Applied to How far along Metr's law can AI start automating or helping with alignment research? by Christopher King 23d ago

Applied to How to mitigate sandbagging by Teun van der Weij 24d ago

Applied to 100+ concrete projects and open problems in evals by Marius Hobbhahn 25d ago

Applied to Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations by Raymond Arnold 1mo ago

Applied to Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format by Roland Pihlakas 1mo ago

Applied to Ontological Validation Manifesto for AIs by Alejandra Ivone Rojas Reyna 1mo ago

Applied to Ontological Validation Manifesto for AIs by Alejandra Ivone Rojas Reyna 1mo ago

Applied to Among Us: A Sandbox for Agentic Deception by Satvik Golechha 1mo ago

Applied to Give Neo a Chance by ank 1mo ago

Applied to Share AI Safety Ideas: Both Crazy and Not by ank 1mo ago

Applied to AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future by DanielFilan 2mo ago

Applied to Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics by ank 2mo ago

Applied to Do models know when they are being evaluated? by Joe Needham 2mo ago

Applied to Artificial Static Place Intelligence: Guaranteed Alignment by ank 2mo ago