This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Transformer Circuits
•
Applied to
Are SAE features from the Base Model still meaningful to LLaVA?
by
Shan Chen
21d
ago
•
Applied to
Concrete Methods for Heuristic Estimation on Neural Networks
by
Oliver Daniels
1mo
ago
•
Applied to
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
by
Connor Kissane
2mo
ago
•
Applied to
Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
by
Taras Kutsyk
3mo
ago
•
Applied to
SAEs (usually) Transfer Between Base and Chat Models
by
Connor Kissane
5mo
ago
•
Applied to
Arrakis - A toolkit to conduct, track and visualize mechanistic interpretability experiments.
by
Yash Srivastava
5mo
ago
•
Applied to
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
by
Neel Nanda
6mo
ago
•
Applied to
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
by
ntt123
6mo
ago
•
Applied to
"What the hell is a representation, anyway?" | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents
by
IwanWilliams
7mo
ago
•
Applied to
Finding Backward Chaining Circuits in Transformers Trained on Tree Search
by
abhayesian
7mo
ago
•
Applied to
Can quantised autoencoders find and interpret circuits in language models?
by
Charlie O'Neill
9mo
ago
•
Applied to
Sparse Autoencoders Work on Attention Layer Outputs
by
Robert Krzyzanowski
1y
ago
•
Applied to
Finding Sparse Linear Connections between Features in LLMs
by
Logan Riggs Smith
1y
ago
•
Applied to
AISC project: TinyEvals
by
Jett Janiak
1y
ago
•
Applied to
Polysemantic Attention Head in a 4-Layer Transformer
by
Jett Janiak
1y
ago
•
Applied to
Graphical tensor notation for interpretability
by
Jordan Taylor
1y
ago
•
Applied to
Interpreting OpenAI's Whisper
by
Neel Nanda
1y
ago
•
Applied to
Automatically finding feature vectors in the OV circuits of Transformers without using probing
by
Jacob Dunefsky
1y
ago
•
Applied to
An adversarial example for Direct Logit Attribution: memory management in gelu-4l
by
Can
1y
ago