User Comment Replies — AI Alignment Forum

Thanks for this write-up! In case it’s of interest, we have also performed some exploratory interpretability work using the SVD of model weights.

We examine convolutional layers in models on a couple common vision tasks (CIFAR-10, ImageNet). In short, we similarly take the SVD of the weights in a CNN layer, $W_{L} = U S V^{T}$ , and project the hidden layer activations $x_{l}$ onto the $i$ th singular vector $V [i, :] x_{l}$ . These singular direction “neurons” can then be studied with interpretability methods: we use hypergraphs, feature visualizati... (read more)

AI ALIGNMENT FORUM
AF

All of Davis Brown's Comments + Replies