AI ALIGNMENT FORUM
AF

Filip Sondej

github.com/filyp

Posts

Sorted by New

36Testing which LLM architectures can do hidden serial reasoning

3mo

0

11Boomerang - protocol to dissolve some commitment races

2y

0

20Spooky action at a distance in the loss landscape

2y

1

9New tool for exploring EA Forum, LessWrong and Alignment Forum - Tree of Tags

3y

0

Wikitag Contributions

Comments

Sorted by

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Filip Sondej11mo10

What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)

This way v won't have any dormant component, so the MLP output after patching also cannot use that dormant pathway.