Related to call for research on evaluating alignment
Here's an experiment I'd love to see someone run (credit to Jeff Wu for the idea, and William Saunders for feedback):
Finetune a language model to report the activation of a particular neuron in text form.
E.g., you feed the model a random sentence that ends in a full stop. Then the model should output a number from 1-10 that reflects a particular neuron's activation.
We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output. However, at lower layers it should be able to do this correctly, with some amount of finetuning.
How many layers do you have to go down before the model succeeds? How does this scale with (a) model size and (b) amount of training?
One subtlety is that finetuning might end up changing that neuron’s activation. To avoid this, we could do something like:
- Run the base model on the sentence
-Train the fine-tuned model to report the activation of the neuron in the base model, given the sentence
- Note whether the activation in the finetuned model is different
Why I think this is interesting:
I often round off alignment to 'build a model that tells us everything it “knows”’. It's useful to determine what pragmatic limits on this are. In particular, it's useful for current alignment research to be able to figure out what our models “know” or don't “know”, and this is helpful for that. It gives us more information about when ‘we tried finetuning the model to tell us X but it didn’t work’ means ‘the model doesn’t know X’, versus when the model may have a neuron that fires for X but is unable to report it in text.
We assume the model will not be able to report the activation of a neuron in the final layer, even in the limit of training on this task, because it doesn't have any computation left to turn the activation into a text output.
Surely there exist correct fixed points, though? (Although probably not that useful, even if feasible)
You mean a fixed point of the model changing its activations as well as what it reports? I was thinking we could rule out the model changing the activations themselves by keeping a fixed base model.
FYI: METR is actively fundraising!
METR is a non-profit research organization. We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted payment from frontier AI labs for running evaluations.[1]
Part of METR's role is to independently assess the arguments that frontier AI labs put forward about the safety of their models. These arguments are becoming increasingly complex and dependent on nuances of how models are trained and how mitigations were developed.
For this reason, it's important that METR has its finger on the pulse of frontier AI safety research. This means hiring and paying for staff that might otherwise work at frontier AI labs, requiring us to compete with labs directly for talent.
The central constraint to our publishing more and better research, and scaling up our work aimed at monitoring the AI industry for catastrophic risk, is growing our team with excellent new researchers and engineers.
And our recruiting is, to some degree, constrained by our fundraising - especially given the skyrocketing comp that AI companies are offering.
To donate to METR, click here: https://metr.org/donate
If you’d like to discuss giving with us first, or receive more information about our work for the purpose of informing a donation, reach out to giving@metr.org
However, we are definitely not immune from conflicting incentives. Some examples:
- We are open to taking donations from individual lab employees (subject to some constraints, e.g. excluding senior decision-makers, constituting <50% of our funding)
- Labs provide us with free model access for conducting our evaluations, and several labs also provide us ongoing free access for research even if we're not conducting a specific evaluation.
Can you say anything about what METR's annual budget/runway is? Given that you raised $17mn a year ago, I would have expected METR to be well funded
Budget: We run at ~$13m p.a. rn (~$15m for the next year under modest growth assumptions, quite plausibly $17m++ given the increasingly insane ML job market).
Audacious funding: This ended up being a bit under $16m, and is a commitment across 3 years.
Runway: Depending on spend/growth assumptions, we have between 12 and 16 months of runway. We want to grow at the higher rate, but we might end up bottlenecked on senior hiring. (But that’s potentially a problem you can spend money to solve - and it also helps to be able to say "we have funding security and we have budget for you to build out a new team").
More context on our thinking: The audacious funding was a one-off, and we need to make sure we have a sustainable funding model. My sense is that for “normal” nonprofits, raising >$10m/yr is considered a big lift that would involve multiple FT fundraisers and a large fraction of org leadership’s time, and even then not necessarily succeed. We have the hypothesis that the AI safety ecosystem can support this level of funding (and more specifically, that funding availability will scale up in parallel with the growth of AI sector in general), but we want to get some evidence that that’s right and build up reasonable runway before we bet too aggressively on it. Our fundraising goal for the end of 2025 is to raise $10M