AI ALIGNMENT FORUM
Tags
AF

(1)

Transformer Circuits

(1)

This page is a stub.

Posts tagged Transformer Circuits

1

19Finding Neurons in a Haystack: Case Studies with Sparse Probing

Wes Gurnee, Neel Nanda

2y

1

2

53An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

7mo

10

1

39200 Concrete Open Problems in Mechanistic Interpretability: Introduction

2y

0

1

27Finding Sparse Linear Connections between Features in LLMs

Logan Riggs Smith, Sam Mitchell, Adam Kaufman

1y

2

2

23How to Think About Activation Patching

2y

3

2

24Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

2y

0

2

16Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy

1y

1

1

18200 COP in MI: Exploring Polysemanticity and Superposition

2y

1

1

17200 COP in MI: Interpreting Algorithmic Problems

2y

0

1

16A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

2y

15

1

11A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

2y

0

1

11200 COP in MI: Analysing Training Dynamics

2y

0

1

8200 COP in MI: Looking for Circuits in the Wild

2y

3

1

7200 COP in MI: Techniques, Tooling and Automation

2y

0

1

35Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, Robert Krzyzanowski, Arthur Conmy, Neel Nanda

1y

3