Towards understanding-based safety evaluations

Here's another idea that is not quite there but could be a component of a solution here:

If a red-team finds some particular model failure (or even some particular benign model behavior), can you fix (or change/remove) that behavior exclusively by removing training data rather than adding it? Certainly I expect it to be possible to fix specific failures by fine-tuning on them, but if you can demonstrate that you can fix failures just by removing existing data, that demonstrates something meaningful about your ability to understand what your model is learning from each data point that it sees.

Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

[-]evhub3y30

Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.

And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

[-]evhub3y5-3

Another thought here:

If we're in a slow enough takeoff world, maybe it's fine to just have the understanding standard here be post-hoc, where labs are required to be able to explain why a failure occurred after it has already occurred. Obviously, at some point I expect us to have to deal with situations where some failures could be unrecoverable, but the hope here would be that if you can demonstrate a level of understanding that has been sufficient to explain exactly why all previous failures occurred, that's a pretty high bar, and it could plausibly be a high enough bar to prevent future catastrophic failures.

[-]evhub3y50

Thanks to Chris Olah for a helpful conversation here.

Some more thoughts on this:

One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you're sort of still in the same boat as you started with behavioral evaluations, since it's not clear why understanding that passes such an evaluation would actually help you deal with worst-case failures in the real world. This is highly related to my discussion of best-case vs. worst-case transparency here.
Another thing worth pointing out here regarding using causal scrubbing for something like this is that causal scrubbing requires some base distribution that you're evaluating over, which means it could fall into a similar sort of trap to that in the first bullet point here. Presumably, if you wanted to build a causal-scrubbing-based safety evaluation, you'd just use the entire training distribution as the distribution you were evaluating over, which seems like it would help a lot with this problem, but it's still not completely clear that it would solve it, especially if you were just evaluating your average-case causal scrubbing loss over that distribution.

[-]Kshitij Sachan3y42

Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.

Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:

"Larger" explanations are penalized more (size refers to the number of dimensions of the residual stream the explanation claims the model is using for a specific behavior).
Explanations must be adversarially robust: an adversary shouldn't be able to include additional parts of the model we claimed are unimportant and have a sizable effect on the scrubbed model's predictions.

This approach doesn't address all the concerns one might have with using causal scrubbing to understand models, but just wanted to flag that this is something we're thinking about as well.

[-]Zhu Xiaohu2y10

I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment

that would lead to safety or alignment goodharting problem.

^{^}

That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

73

Towards understanding-based safety evaluations

73