Oliver Daniels

PhD Student at Umass Amherst

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

IMO most exciting mech-interp research since SAEs, great work. 

A few thoughts / questions:

  • curious about your optimism regarding learned masks as attribution method - seems like the problem of learning mechanisms that don't correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once APD is more mature)
  • relatedly, the hidden layer "minimality" loss is really cool, excited to see how useful this is in mitigating the problem above (and diagnosing it)
  • have you tried "ablating" weights by sampling from the initialization distribution rather than zero ablating? haven't thought about it too much and don't have a clear argument for why it would improve anything, but it feels more akin to resample ablation and "presumption of independence" type stuff
  • seems great for mechanistic anomaly detection! very intuitive to map APD to surprise accounting (I was vaguely trying to get at a method like APD here)

oh I see, by all(sensor_preds) I meant  sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i)) 

b) AUROC(all(sensor_preds), all(sensors))

did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check). 

[This comment is no longer endorsed by its author]Reply