This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Corrigibility
Settings
Applied to
Reframing AI Safety Through the Lens of Identity Maintenance Framework
by
Hiroshi Yamakawa
20d
ago
Mateusz Bagiński
v1.16.0
Mar 23rd 2025 GMT
(+119)
LW
4
Mateusz Bagiński
v1.15.0
Mar 23rd 2025 GMT
(
+12
/
-13
)
LW
2
Tassilo Neubauer
v1.14.0
Feb 20th 2025 GMT
(-1)
LW
1
Tassilo Neubauer
v1.13.0
Feb 20th 2025 GMT
LW
1
Tassilo Neubauer
v1.12.0
Feb 20th 2025 GMT
(+3)
LW
1
Applied to
Thermodynamic entropy = Kolmogorov complexity
by
Tassilo Neubauer
2mo
ago
changed deleted from false to true
Lens: LW Wiki
RobertM
v1.4.0
Feb 19th 2025 GMT
LW
0
Applied to
Should you publish solutions to corrigibility?
by
Ruben Bloom
3mo
ago
Applied to
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
by
johnswentworth
3mo
ago
Applied to
Detect Goodhart and shut down
by
Jeremy Gillen
3mo
ago
Applied to
Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
by
Roland Pihlakas
3mo
ago
Applied to
Testing for Scheming with Model Deletion
by
Matthew Barnett
3mo
ago
Lens: LW Wiki
Dakara
v1.3.0
Dec 30th 2024 GMT
(+184)
LW
1
Applied to
AI Assistants Should Have a Direct Line to Their Developers
by
Jan_Kulveit
4mo
ago
Applied to
Corrigibility's Desirability is Timing-Sensitive
by
RobertM
4mo
ago
Applied to
AIs Will Increasingly Fake Alignment
by
Ruben Bloom
4mo
ago
Applied to
Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility
by
Raymond Arnold
7mo
ago