This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
AI Evaluations
Settings
Applied to
Can SAE steering reveal sandbagging?
by
jordine
17h
ago
Applied to
Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
by
Adam Karvonen
2d
ago
Applied to
Seeking feedback on "MAD Chairs: A new tool to evaluate AI"
by
Chris Santos-Lang
14d
ago
Applied to
Introducing BenchBench: An Industry Standard Benchmark for AI Strength
by
Arun Jose
14d
ago
Applied to
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
by
Raymond Arnold
22d
ago
Applied to
How far along Metr's law can AI start automating or helping with alignment research?
by
Christopher King
23d
ago
Applied to
How to mitigate sandbagging
by
Teun van der Weij
24d
ago
Applied to
100+ concrete projects and open problems in evals
by
Marius Hobbhahn
25d
ago
Applied to
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
by
Raymond Arnold
1mo
ago
Applied to
Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
by
Roland Pihlakas
1mo
ago
Applied to
Ontological Validation Manifesto for AIs
by
Alejandra Ivone Rojas Reyna
1mo
ago
Applied to
Ontological Validation Manifesto for AIs
by
Alejandra Ivone Rojas Reyna
1mo
ago
Applied to
Among Us: A Sandbox for Agentic Deception
by
Satvik Golechha
1mo
ago
Applied to
Give Neo a Chance
by
ank
1mo
ago
Applied to
Share AI Safety Ideas: Both Crazy and Not
by
ank
1mo
ago
Applied to
AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
by
DanielFilan
2mo
ago
Applied to
Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics
by
ank
2mo
ago
Applied to
Do models know when they are being evaluated?
by
Joe Needham
2mo
ago
Applied to
Artificial Static Place Intelligence: Guaranteed Alignment
by
ank
2mo
ago