This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Deceptive Alignment
Settings
Applied to
Insights from a Lawyer turned AI Safety researcher (ShortForm)
by
Katalina Hernandez
2d
ago
Applied to
Correcting Deceptive Alignment using a Deontological Approach
by
JeaniceK
2d
ago
Applied to
Mapping AI Architectures to Alignment Attractors: A SIEM-Based Framework
by
silentrevolutions
3d
ago
Applied to
How training-gamers might function (and win)
by
Vivek Hebbar
4d
ago
Applied to
Mistral Large 2 (123B) exhibits alignment faking
by
Gunnar Zarncke
20d
ago
Applied to
We Have No Plan for Preventing Loss of Control in Open Models
by
Andrew Dickson
1mo
ago
Applied to
Superintelligence's goals are likely to be random
by
Mikhail Samin
1mo
ago
Applied to
We should start looking for scheming "in the wild"
by
Marius Hobbhahn
1mo
ago
Applied to
The Hidden Cost of Our Lies to AI
by
Nicholas Andresen
1mo
ago
Applied to
For scheming, we should first focus on detection and then on prevention
by
Marius Hobbhahn
1mo
ago
Applied to
Cautions about LLMs in Human Cognitive Loops
by
Alice Blair
1mo
ago
Applied to
Do we want alignment faking?
by
Florian_Dietz
2mo
ago
Applied to
Does human (mis)alignment pose a significant and imminent existential threat?
by
jr
2mo
ago
Applied to
Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics
by
ank
2mo
ago
Applied to
Places of Loving Grace [Story]
by
ank
2mo
ago
Applied to
Do models know when they are being evaluated?
by
Joe Needham
2mo
ago
Applied to
Artificial Static Place Intelligence: Guaranteed Alignment
by
ank
2mo
ago
Applied to
Ambiguous out-of-distribution generalization on an algorithmic task
by
Wilson Wu
2mo
ago
Applied to
Self-dialogue: Do behaviorist rewards make scheming AGIs?
by
Steve Byrnes
2mo
ago
Applied to
Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics... (Updated)
by
ank
2mo
ago