This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Experiments
Settings
Applied to
Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
by
Roland Pihlakas
10d
ago
Applied to
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
by
Florian_Dietz
16d
ago
Applied to
Gradient Anatomy's - Hallucination Robustness in Medical Q&A
by
Diego Sabajo
1mo
ago
Applied to
Memorization-generalization in practice
by
Dmitry Vaintrob
2mo
ago
Applied to
Why I'm Pouring Cold Water in My Left Ear, and You Should Too
by
Maloew
2mo
ago
Applied to
Early Experiments in Human Auditing for AI Control
by
Joey Yudelson
2mo
ago
Applied to
Has Someone Checked The Cold-Water-In-Left-Ear Thing?
by
Maloew
3mo
ago
Applied to
0.202 Bits of Evidence In Favor of Futarchy
by
niplav
6mo
ago
Applied to
Pomodoro Method Randomized Self Experiment
by
niplav
6mo
ago
Applied to
[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
by
Yohan Mathew
6mo
ago
Applied to
Who Feels More Alone?
by
marvinscheffold
6mo
ago
Applied to
Michael Dickens' Caffeine Tolerance Research
by
niplav
7mo
ago
Applied to
Inference-Only Debate Experiments Using Math Problems
by
Arjun Panickssery
8mo
ago
Applied to
The need for multi-agent experiments
by
MartÃn Soto
8mo
ago
Applied to
Notifications Received in 30 Minutes of Class
by
Mir
10mo
ago
Applied to
My hour of memoryless lucidity
by
Gunnar Zarncke
11mo
ago
Applied to
Claude wants to be conscious
by
Joe Kwon
1y
ago