AI ALIGNMENT FORUM
Wikitags
AF

Experiments

Settings

Applied to Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format by Roland Pihlakas 10d ago

Applied to Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens by Florian_Dietz 16d ago

Applied to Gradient Anatomy's - Hallucination Robustness in Medical Q&A by Diego Sabajo 1mo ago

Applied to Memorization-generalization in practice by Dmitry Vaintrob 2mo ago

Applied to Why I'm Pouring Cold Water in My Left Ear, and You Should Too by Maloew 2mo ago

Applied to Early Experiments in Human Auditing for AI Control by Joey Yudelson 2mo ago

Applied to Has Someone Checked The Cold-Water-In-Left-Ear Thing? by Maloew 3mo ago

Applied to 0.202 Bits of Evidence In Favor of Futarchy by niplav 6mo ago

Applied to Pomodoro Method Randomized Self Experiment by niplav 6mo ago

Applied to [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs by Yohan Mathew 6mo ago

Applied to Who Feels More Alone? by marvinscheffold 6mo ago

Applied to Michael Dickens' Caffeine Tolerance Research by niplav 7mo ago

Applied to Inference-Only Debate Experiments Using Math Problems by Arjun Panickssery 8mo ago

Applied to The need for multi-agent experiments by Martín Soto 8mo ago

Applied to Notifications Received in 30 Minutes of Class by Mir 10mo ago

Applied to My hour of memoryless lucidity by Gunnar Zarncke 11mo ago

Applied to Claude wants to be conscious by Joe Kwon 1y ago