Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."
Very cool work!
clean_pattern * clean_pattern_grad
as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad
? Zero ablation's approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad
.
I'm not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networ... (read more)