User Comment Replies — AI Alignment Forum

EIS IX: Interpretability and Adversaries

Fourth, and most importantly, if superposition happens more in narrower layers, and if superposition is a cause of adversarial vulnerabilities, this would predict that deep, narrow networks would be less adversarially robust than shallow, wide networks that achieve the same performance and have the same number of parameters. However, Huang et al., (2022) found the exact opposite to be the case.

I'm not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networ... (read more)

1Stephen Casper2y

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment. A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets.

Attribution Patching: Activation Patching At Industrial Scale

Xander Davies2y10

Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

Attribution Patching: Activation Patching At Industrial Scale

Xander Davies2y20

Very cool work!

In the attention attribution section, you use clean_pattern * clean_pattern_grad as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad? Zero ablation's approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad.
- Currently, negative name movers end up with negative attributions, but we'd like them to be positive (since zero ablating helps performance and moves our metric towards one), right?
- Of course, this doesn't matter when you are just looking at magnitudes.
Cool to

... (read more)

1Neel Nanda2y

These bugs should be fixed, thanks for flagging!

2Neel Nanda2y

Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did. And yeah, I would be excited to see this applied to mean ablation! Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...

AI ALIGNMENT FORUM
AF

All of Xander Davies's Comments + Replies