Related to call for research on evaluating alignment
Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback):
Finetune a language model to report the activation of a particular neuron in text form.
E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation.
We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output. However, at lower layers it should be able to do this correctly, with some amount of finetuning.
How many layers do you have to go down before the model succeeds? How does this scale with (a) model size and (b) amount of training?
One subtlety is that finetuning might end up changing that neuron’s activation. To avoid this, we could do something like:
- Run the base model on the sentence
-Train the fine-tuned model to report the activation of the neuron in the base model, given the sentence
- Note whether the activation in the finetuned model is different
Why I think this is interesting:
I often round off alignment to 'build a model that tells us everything it “knows”’. It's useful to determine what pragmatic limits on this are. In particular, it's useful for current alignment research to be able to figure out what our models “know” or don't “know”, and this is helpful for that. It gives us more information about when ‘we tried finetuning the model to tell us X but it didn’t work’ means ‘the model doesn’t know X’, versus when the model may have a neuron that fires for X but is unable to report it in text.
We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output.
Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)
You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.