This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
MATS Program
•
Applied to
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
by
Ryan Kidd
4d
ago
•
Applied to
Intricacies of Feature Geometry in Large Language Models
by
Satvik Golechha
20d
ago
•
Applied to
Debating with More Persuasive LLMs Leads to More Truthful Answers
by
Ryan Kidd
2mo
ago
•
Applied to
Automating LLM Auditing with Developmental Interpretability
by
DanielFilan
2mo
ago
•
Applied to
SAE Probing: What is it good for? Absolutely something!
by
Subhash Kantamneni
2mo
ago
•
Applied to
Bridging the VLM and mech interp communities for multimodal interpretability
by
Sonia Joseph
2mo
ago
•
Applied to
The slingshot helps with learning
by
Wilson Wu
2mo
ago
•
Applied to
Improving Model-Written Evals for AI Safety Benchmarking
by
Sunishchal Dev
2mo
ago
•
Applied to
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
by
Marcus Williams
2mo
ago
•
Applied to
Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
by
Kola Ayonrinde
2mo
ago
•
Applied to
[Job Ad] MATS is hiring!
by
Jana
2mo
ago
•
Applied to
MATS AI Safety Strategy Curriculum v2
by
Ryan Kidd
3mo
ago
•
Applied to
Domain-specific SAEs
by
jacob_drori
3mo
ago
•
Applied to
[Interim research report] Evaluating the Goal-Directedness of Language Models
by
Rauno Arike
3mo
ago
•
Applied to
MATS Alumni Impact Analysis
by
Ryan Kidd
3mo
ago
•
Applied to
The Geometry of Feelings and Nonsense in Large Language Models
by
Ryan Kidd
3mo
ago
•
Applied to
Apply to MATS 7.0!
by
Ryan Kidd
3mo
ago
•
Applied to
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
by
Ryan Kidd
3mo
ago
•
Applied to
Showing SAE Latents Are Not Atomic Using Meta-SAEs
by
Ryan Kidd
3mo
ago
•
Applied to
Experiments with an alternative method to promote sparsity in sparse autoencoders
by
Ryan Kidd
4mo
ago