This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Apollo Research (org)
•
Applied to
An Opinionated Evals Reading List
by
Marius Hobbhahn
1mo
ago
•
Applied to
[Interim research report] Activation plateaus & sensitive directions in GPT2
by
Stefan Heimersheim
2mo
ago
•
Applied to
Apollo Research is hiring evals and interpretability engineers & scientists
by
Lee Sharkey
3mo
ago
•
Applied to
Theories of Change for AI Auditing
by
Lee Sharkey
3mo
ago
•
Applied to
A starter guide for evals
by
Lee Sharkey
3mo
ago
•
Applied to
We need a Science of Evals
by
Lee Sharkey
3mo
ago
•
Applied to
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
by
Lee Sharkey
3mo
ago
•
Applied to
Understanding strategic deception and deceptive alignment
by
Lee Sharkey
3mo
ago
•
Applied to
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
by
Lee Sharkey
3mo
ago
•
Applied to
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
by
Lee Sharkey
3mo
ago
•
Applied to
Interpretability: Integrated Gradients is a decent attribution method
by
Lee Sharkey
3mo
ago
•
Applied to
You can remove GPT2’s LayerNorm by fine-tuning for an hour
by
Lee Sharkey
3mo
ago
•
Applied to
Sparsify: A mechanistic interpretability research agenda
by
Lee Sharkey
3mo
ago
•
Applied to
Apollo Research 1-year update
by
Lee Sharkey
3mo
ago
•
Applied to
Announcing Apollo Research
by
Lee Sharkey
3mo
ago
•
Applied to
SAE feature geometry is outside the superposition hypothesis
by
Lee Sharkey
3mo
ago
•
Applied to
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
by
Lee Sharkey
4mo
ago
•
Created by
Lee Sharkey
at
4mo