This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Apollo Research (org)
Settings
Applied to
Detecting Strategic Deception Using Linear Probes
by
Nicholas Goldowsky-Dill
1mo
ago
Applied to
Paper: Open Problems in Mechanistic Interpretability
by
bilalchughtai
2mo
ago
Applied to
Attribution-based parameter decomposition
by
Lucius Bushnaq
2mo
ago
Applied to
Activation space interpretability may be doomed
by
bilalchughtai
3mo
ago
Applied to
An Opinionated Evals Reading List
by
Marius Hobbhahn
5mo
ago
Applied to
[Interim research report] Activation plateaus & sensitive directions in GPT2
by
Stefan Heimersheim
7mo
ago
Applied to
Apollo Research is hiring evals and interpretability engineers & scientists
by
Lee Sharkey
7mo
ago
Applied to
Theories of Change for AI Auditing
by
Lee Sharkey
7mo
ago
Applied to
A starter guide for evals
by
Lee Sharkey
7mo
ago
Applied to
We need a Science of Evals
by
Lee Sharkey
7mo
ago
Applied to
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
by
Lee Sharkey
7mo
ago
Applied to
Understanding strategic deception and deceptive alignment
by
Lee Sharkey
7mo
ago
Applied to
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
by
Lee Sharkey
7mo
ago
Applied to
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
by
Lee Sharkey
7mo
ago
Applied to
Interpretability: Integrated Gradients is a decent attribution method
by
Lee Sharkey
7mo
ago
Applied to
You can remove GPT2’s LayerNorm by fine-tuning for an hour
by
Lee Sharkey
7mo
ago
Applied to
Sparsify: A mechanistic interpretability research agenda
by
Lee Sharkey
7mo
ago
Applied to
Apollo Research 1-year update
by
Lee Sharkey
7mo
ago
Applied to
Announcing Apollo Research
by
Lee Sharkey
7mo
ago
Applied to
SAE feature geometry is outside the superposition hypothesis
by
Lee Sharkey
7mo
ago