AI ALIGNMENT FORUM
Wikitags
AF

AI-Assisted Alignment

Settings

Applied to Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning by Jeremias Ferrao 4d ago

Applied to Proposal: Derivative Information Theory (DIT) — A Dynamic Model of Agency and Consciousness by Yogmog 8d ago

Applied to Prospects for Alignment Automation: Interpretability Case Study by Ebenezer Dukakis 25d ago

Applied to How far along Metr's law can AI start automating or helping with alignment research? by Christopher King 1mo ago

Applied to AI Tools for Existential Security by Raymond Arnold 1mo ago

Applied to AI for AI safety by Raymond Arnold 1mo ago

Applied to The prospect of accelerated AI safety progress, including philosophical progress by Mitchell_Porter 1mo ago

Applied to Is Alignment a flawed approach? by Patrick Bernard 1mo ago

Applied to Maintaining Alignment during RSI as a Feedback Control Problem by Beren Millidge 2mo ago

Applied to Recursive alignment with the principle of alignment by hive 2mo ago

Applied to Deep sparse autoencoders yield interpretable features too by Armaan A. Abraham 2mo ago

Applied to How might we safely pass the buck to AI? by Mitchell_Porter 2mo ago

Applied to Artificial Static Place Intelligence: Guaranteed Alignment by ank 2mo ago

Applied to Anti-Slop Interventions? by Mateusz Bagiński 2mo ago

Applied to Does Time Linearity Shape Human Self-Directed Evolution, and will AGI/ASI Transcend or Destabilise Reality? by The Perceptive Architect 2mo ago

Applied to Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog) by Archimedes 3mo ago

Applied to Gettier Cases [repost] by Antigone 3mo ago

Applied to Language Models and World Models, a Philosophy by kyjohnso 3mo ago

Applied to The Overlap Paradigm: Rethinking Data's Role in Weak-to-Strong Generalization (W2SG) by Serhii Zamrii 3mo ago