User Comment Replies — AI Alignment Forum

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition... (read more)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Gianluca Calcagni8mo00

2Neel Nanda8mo

Glad you liked the post! I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added. I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)

AI ALIGNMENT FORUM
AF

All of Gianluca Calcagni's Comments + Replies