AI ALIGNMENT FORUM
Wikitags
AF

Corrigibility

Settings

Applied to Reframing AI Safety Through the Lens of Identity Maintenance Framework by Hiroshi Yamakawa 20d ago

Mateusz Bagiński v1.16.0Mar 23rd 2025 GMT (+119) LW4

Mateusz Bagiński v1.15.0Mar 23rd 2025 GMT (+12/-13) LW2

Tassilo Neubauer v1.14.0Feb 20th 2025 GMT (-1) LW1

Tassilo Neubauer v1.13.0Feb 20th 2025 GMT LW1

Tassilo Neubauer v1.12.0Feb 20th 2025 GMT (+3) LW1

Applied to Thermodynamic entropy = Kolmogorov complexity by Tassilo Neubauer 2mo ago

changed deleted from false to true

Lens: LW Wiki

RobertM v1.4.0Feb 19th 2025 GMT LW0

Applied to Should you publish solutions to corrigibility? by Ruben Bloom 3mo ago

Applied to Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals by johnswentworth 3mo ago

Applied to Detect Goodhart and shut down by Jeremy Gillen 3mo ago

Applied to Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well) by Roland Pihlakas 3mo ago

Applied to Testing for Scheming with Model Deletion by Matthew Barnett 3mo ago

Lens: LW Wiki

Dakara v1.3.0Dec 30th 2024 GMT (+184) LW1

Applied to AI Assistants Should Have a Direct Line to Their Developers by Jan_Kulveit 4mo ago

Applied to Corrigibility's Desirability is Timing-Sensitive by RobertM 4mo ago

Applied to AIs Will Increasingly Fake Alignment by Ruben Bloom 4mo ago

Applied to Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility by Raymond Arnold 7mo ago