Incidental polysemanticity

Kushal Thaman; tmychow; Rylan Schaeffer

My guess is that this result is very sensitive to the design of the training dataset:

the input/output data pairs are for $i \in [n]$ , where $e_{i} \in R^{n}$ is the $i^{th}$ basis vector.

In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input. I'd be interested to see experiments where each feature is turned on with some (not too small) probability, independently of all other features, similarly to the original toy models setting. This would result in some inputs where feature i and j are on simultaneously. My prediction would be that polysemanticity goes down very significantly (probably to zero if the probabilities are high enough and the training is done for long enough).

I also don't understand why L1 regularization on activations is necessary to show incidental polysemanticity given your setup. Even if you remove the L1 regularization on activations, it is still the case that "benign collisions" impose no cost on the model, since feature i and feature j are never simultaneously present in a given input. So if you do get a benign collision, what causes it to go away? Overall my expectation would be that without the L1 regularization on activations (and with the training dataset as described in this post), you'd get a complicated mess where every neuron is highly polysemantic, i.e. even more polysemanticity than described in this post. Why is that wrong?

[-]Victor Lecomte2y31

Thanks for the feedback!

In particular, I think it is likely very sensitive to the implicit assumption that feature i and feature j never co-occur on a single input.

Definitely! I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.

Overall my expectation would be that without the L1 regularization on activations (and with the training dataset as described in this post), you'd get a complicated mess where every neuron is highly polysemantic, i.e. even more polysemanticity than described in this post. Why is that wrong?

If there is no L1 regularization on activations, then every hidden neuron would indeed be highly "polysemantic" in the sense that it has nonzero weights for each input feature. But on the other hand, the whole encoding space would become rotationally symmetric, and when that's the case it feels like polysemanticity shouldn't be about individual neurons (since the canonical basis is not special anymore) and instead about the angles that different encodings form. In particular, as long as mgen, the space of optimal solutions for this setup requires the encodings to form angles of at least 90° with each other, and it's unclear whether we should call this polysemantic.

So one of the reasons why we need L1 regularization is to break the rotational symmetry and create a privileged basis: that way, it's actually meaningful to ask whether a particular hidden neuron is representing more than one feature.

[-]Rohin Shah2y31

Good point on the rotational symmetry, that makes sense now.

I still think that this assumption is fairly realistic because in practice, most pairs of unrelated features would co-occur only very rarely, and I expect the winner-take-all dynamic to dominate most of the time. But I agree that it would be nice to quantify this and test it out.

Agreed that's a plausible hypothesis. I mostly wish that in this toy model you had a hyperparameter for the frequency of co-occurrence of features, and identified how it affects the rate of incidental polysemanticity.

[-]Pavan Katta2y00

Great work! Love the push for intuitions especially in the working notes.

My understanding of superposition hypothesis from TMS paper has been(feel free to correct me!):

When there's no privileged basis polysemanticity is the default as there's no reason to expect interpretable neurons.
When there's a privileged basis either because of non linearity on the hidden layer or L1 regularisation, default is monosemanticity and superposition pushes towards polysemanticity when there's enough sparsity.

Is it possible that the features here are not enough basis aligned and is closer to case 1? As you already commented demonstrating polysemanticity when the hidden layer has a non linearity and m>n would be principled imo.

see e.g. the "Polysemantic Neurons" section in Zoom In: An Introduction to Circuits ↩︎
When we say a neuron is correlated with a feature, what we more formally mean is that the neuron's activation is correlated with whether the feature is present in the input (where the correlation is taken over the data points). But the former is easier to say. ↩︎
Analogous phenomena are known under other names, such as "privileged basis". ↩︎
depending mostly on the specifics of the neural architecture and the data (but also on the random initializations of the weights) ↩︎
Here, we define a "collision" as the event that two features $i$ and $j$ collide. So for example there is a three-way collision between $i$ , $j$ and $k$ , that would count as three collisions between $i$ and $j$ , $i$ and $k$ , and $j$ and $k$ . ↩︎
We use $∥ \cdot ∥$ to denote Euclidean length ( $l_{2}$ norm), and $∥ \cdot ∥_{1}$ to denote Manhattan length ( $l_{1}$ norm). ↩︎
It's equivalent to making $λ$ four times larger and making training time four times slower. ↩︎
This would not necessarily be the largest weight at initialization, since there might be significant collisions with other encodings, but the largest weight at initialization is still the most likely to win the race all things considered. ↩︎
We're referring to the fact that the encoding matrix $W^{T}$ is forced to be the transpose of the decoding matrix $W$ . This assumption makes sense because even if they were kept independent and initialized to different values, they would naturally acquire similar values over time because of the learning dynamics. Indeed, the $i^{t h}$ column of the encoding matrix and the $i^{t h}$ row of the decoding matrix "reinforce each other" through the feature benefit force until they have an inner product of $1$ , and ao as long as they start out small or if there is some weight decay, they would end up almost identical by the end of training. ↩︎
If the input features are not the canonical basis vectors but are still orthogonal (and the outputs are still basis vectors), then we could apply a fixed linear transformation to the encoding matrix and recover the same training dynamics. And in general it makes sense to consider orthogonal input features, because when the features themselves are not orthogonal (or at least approximately orthogonal), the question of what polysemanticity even is becomes more murky. ↩︎

Assumption	Reason
the weights $W_{i k}$ are initialized to i.i.d. normals of mean $0$ and standard deviation $Θ (1 / \sqrt{m})$	so that the encodings $W_{i} \in R^{m}$ start out with constant length
$m \geq n$	just to make it clear that polysemanticity was not necessary in this case
$λ \leq 1 / \sqrt{m}$	so that the $l_{1}$ regularization doesn't kill all weights immediately

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

Incidental polysemanticity

16

Summary

Intuition

Outline

Setup

Model

Possible solutions

Loss and dynamics

The winning neuron takes it all

Sparsity force

How fast does it sparsify?

Numerical simulations

Interference arbiters collisions between features

How strong is the interference?

Benign and malign collisions

Experiments

Discussion and future work

Implications for mechanistic interpretability

A more realistic toy model

Gaps in the theory

Author contributions