AI ALIGNMENT FORUM
Wikitags
AF

Apollo Research (org)

Settings

Applied to Detecting Strategic Deception Using Linear Probes by Nicholas Goldowsky-Dill 1mo ago

Applied to Paper: Open Problems in Mechanistic Interpretability by bilalchughtai 2mo ago

Applied to Attribution-based parameter decomposition by Lucius Bushnaq 2mo ago

Applied to Activation space interpretability may be doomed by bilalchughtai 3mo ago

Applied to An Opinionated Evals Reading List by Marius Hobbhahn 5mo ago

Applied to [Interim research report] Activation plateaus & sensitive directions in GPT2 by Stefan Heimersheim 7mo ago

Applied to Apollo Research is hiring evals and interpretability engineers & scientists by Lee Sharkey 7mo ago

Applied to Theories of Change for AI Auditing by Lee Sharkey 7mo ago

Applied to A starter guide for evals by Lee Sharkey 7mo ago

Applied to We need a Science of Evals by Lee Sharkey 7mo ago

Applied to Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs by Lee Sharkey 7mo ago

Applied to Understanding strategic deception and deceptive alignment by Lee Sharkey 7mo ago

Applied to Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning by Lee Sharkey 7mo ago

Applied to The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks by Lee Sharkey 7mo ago

Applied to Interpretability: Integrated Gradients is a decent attribution method by Lee Sharkey 7mo ago

Applied to You can remove GPT2’s LayerNorm by fine-tuning for an hour by Lee Sharkey 7mo ago

Applied to Sparsify: A mechanistic interpretability research agenda by Lee Sharkey 7mo ago

Applied to Apollo Research 1-year update by Lee Sharkey 7mo ago

Applied to Announcing Apollo Research by Lee Sharkey 7mo ago

Applied to SAE feature geometry is outside the superposition hypothesis by Lee Sharkey 7mo ago