User Comment Replies — AI Alignment Forum

Comparing Anthropic's Dictionary Learning to Ours

This is cool! These sparse features should be easily "extractable" by the transformer's key, query, and value weights in a single layer. Therefore, I'm wondering if these weights can somehow make it easier to "discover" the sparse features?

Penalize Model Complexity Via Self-Distillation

research_prime_space2y10

I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt.
Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model.

The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking.

Penalize Model Complexity Via Self-Distillation

research_prime_space2y21

I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter.

But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)?

1Stephen Casper2y

One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts. I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974

Penalize Model Complexity Via Self-Distillation

research_prime_space2y10

I just wanted to add another angle. Neural networks have a fundamental "simplicity bias", where they learn low frequency components exponentially faster than high frequency components. Thus, self-distillation is likely to be more efficient than training on the original dataset (the function you're learning has fewer high frequency components). This paper formalizes this claim.

But in practice, what this means is that training GPT-3.5 from scratch is hard but simply copying GPT-3.5 is pretty easy. Stanford was recently able to finetune a pretty bad 7B ... (read more)

AI ALIGNMENT FORUM
AF

All of research_prime_space's Comments + Replies