Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:
Looking at behaviour is conceptually straightforward, and valuable, and being done
I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.
Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.
"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
I'd strongly bet that when you break this down in more concrete detail, a flaw in your plan will emerge.
The balance of industries serving humans vs. AI's is a suspiciously high level of abstraction.
Thanks for posting this. I've been confused about the connection between shard theory and activation vectors for a long time!
AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other
This confuses me.
I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn't two full motivational circuits, but it kind of is tow motivational circuits.
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment).
I’m confused by your use of Shapley values. Shapley values assume that the “coalition” can form in any order, but that doesn’t seem like a good fit for language models where order is important.
Great article.
One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.
Interesting work.
This post has made me realise that constitutional design is surprisingly neglected in the AI safety community.
Designing the right constitution won't save the world by itself, but it's a potentially easy win that could put us in a better strategic situation down the line.
I would love to hear what shard theorists make of this.
We could describe this AI as having learned a meta-shard - pace around at the start so that you have time to plan.
But at the point where we've allowed meta-shards, maybe we've already undermined the main claims of shard theory?
I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.