AI ALIGNMENT FORUM
AF

Wes Gurnee

OR PhD student at MIT working on interpretability.

Find out more here: https://wesg.me/

Posts

Sorted by New

77Refusal in LLMs is mediated by a single direction

1y

44

52SAE reconstruction errors are (empirically) pathological

1y

1

19Finding Neurons in a Haystack: Case Studies with Sparse Probing

2y

1

Wikitag Contributions

Comments

Sorted by