User Comment Replies — AI Alignment Forum

Impact stories for model internals: an exercise for interpretability researchers

I had cached impressions that AI safety people were interested in auditing, ELK, and scalable oversight.

A few AIS people who volunteered to give feedback before the workshop (so biased towards people who were interested in the title) each named a unique top choice: scientific understanding (specifically threat models), model editing, and auditing (so 2/3 were unexpected for me).

During the workshop, attendees (again, biased, as they self-selected into the session) expressed excitement most about auditing, unlearning, MAD, ELK, and general scientific underst... (read more)

1Lawrence Chan1y

Thanks! Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).

A comparison of causal scrubbing, causal abstractions, and related methods

Jenny Nitishinskaya2y10

Not sure if I'm fully responding to your q but...

there might be no canonical topology for the original computation

This sounds right to me, and overall I mostly think of treeification as just a kind of extensional rewrite (plus adding more inputs).

these hypotheses can't be understood as making precise claims about the original computation anymore

I think of the underlying graph as providing some combination of 1) causal relationships, and 2) smaller pieces to help with search/reasoning, rather than being an object we inherently care about. (It's possibly use... (read more)

A comparison of causal scrubbing, causal abstractions, and related methods

Jenny Nitishinskaya2y40

This is a nice comparison. I particularly like the images :) and drawing the comparisons setting aside historical accidents.

A few comments that came to mind as I was reading:

Perform an interchange intervention on the treeification of L such that the corresponding intervention in the treeification of H would not change any values.

As far as I saw, you don’t mention how causal scrubbing specifies selecting the interchange intervention (the answer is: preserving the distribution of inputs to nodes in H, see e.g. the Appendix post). I think ... (read more)

1Erik Jenner2y

Thanks! Mostly agree with your comments. I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different hypotheses we can make, but these hypotheses can't be understood as making precise claims about the original computation anymore. I could even see an argument that allowing rewrites in some sense always makes hypotheses less precise, but I feel pretty confused about what rewrites even are given that there might be no canonical topology for the original computation.

AI ALIGNMENT FORUM
AF

All of jenny's Comments + Replies