This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
AI-Assisted Alignment
Settings
Applied to
Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning
by
Jeremias Ferrao
4d
ago
Applied to
Proposal: Derivative Information Theory (DIT) — A Dynamic Model of Agency and Consciousness
by
Yogmog
8d
ago
Applied to
Prospects for Alignment Automation: Interpretability Case Study
by
Ebenezer Dukakis
25d
ago
Applied to
How far along Metr's law can AI start automating or helping with alignment research?
by
Christopher King
1mo
ago
Applied to
AI Tools for Existential Security
by
Raymond Arnold
1mo
ago
Applied to
AI for AI safety
by
Raymond Arnold
1mo
ago
Applied to
The prospect of accelerated AI safety progress, including philosophical progress
by
Mitchell_Porter
1mo
ago
Applied to
Is Alignment a flawed approach?
by
Patrick Bernard
1mo
ago
Applied to
Maintaining Alignment during RSI as a Feedback Control Problem
by
Beren Millidge
2mo
ago
Applied to
Recursive alignment with the principle of alignment
by
hive
2mo
ago
Applied to
Deep sparse autoencoders yield interpretable features too
by
Armaan A. Abraham
2mo
ago
Applied to
How might we safely pass the buck to AI?
by
Mitchell_Porter
2mo
ago
Applied to
Artificial Static Place Intelligence: Guaranteed Alignment
by
ank
2mo
ago
Applied to
Anti-Slop Interventions?
by
Mateusz Bagiński
2mo
ago
Applied to
Does Time Linearity Shape Human Self-Directed Evolution, and will AGI/ASI Transcend or Destabilise Reality?
by
The Perceptive Architect
2mo
ago
Applied to
Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
by
Archimedes
3mo
ago
Applied to
Gettier Cases [repost]
by
Antigone
3mo
ago
Applied to
Language Models and World Models, a Philosophy
by
kyjohnso
3mo
ago
Applied to
The Overlap Paradigm: Rethinking Data's Role in Weak-to-Strong Generalization (W2SG)
by
Serhii Zamrii
3mo
ago