AI ALIGNMENT FORUMTags
AF

Apollo Research (org)

•

Applied to An Opinionated Evals Reading List by Marius Hobbhahn 1mo ago

•

Applied to [Interim research report] Activation plateaus & sensitive directions in GPT2 by Stefan Heimersheim 2mo ago

•

Applied to Apollo Research is hiring evals and interpretability engineers & scientists by Lee Sharkey 3mo ago

•

Applied to Theories of Change for AI Auditing by Lee Sharkey 3mo ago

•

Applied to A starter guide for evals by Lee Sharkey 3mo ago

•

Applied to We need a Science of Evals by Lee Sharkey 3mo ago

•

Applied to Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs by Lee Sharkey 3mo ago

•

Applied to Understanding strategic deception and deceptive alignment by Lee Sharkey 3mo ago

•

Applied to Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning by Lee Sharkey 3mo ago

•

Applied to The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks by Lee Sharkey 3mo ago

•

Applied to Interpretability: Integrated Gradients is a decent attribution method by Lee Sharkey 3mo ago

•

Applied to You can remove GPT2’s LayerNorm by fine-tuning for an hour by Lee Sharkey 3mo ago

•

Applied to Sparsify: A mechanistic interpretability research agenda by Lee Sharkey 3mo ago

•

Applied to Apollo Research 1-year update by Lee Sharkey 3mo ago

•

Applied to Announcing Apollo Research by Lee Sharkey 3mo ago

•

Applied to SAE feature geometry is outside the superposition hypothesis by Lee Sharkey 3mo ago

•

Applied to A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team by Lee Sharkey 4mo ago

•

Created by Lee Sharkey at 4mo