Review
Long-term planning is another capability that is likely necessary for deceptive alignment that could. Obviously a large alignment tax, but there are potentially ways to mitigate that. It seems at least as promising as some other approaches you listed.
The following are strategies for mitigating deceptive alignment. Some of these ideas are novel (at least to me).
I'm highly uncertain how well many of these strategies will work in practice and many of them also clearly don't scale. This is meant to be a 'kitchen skink' collection so my bar for including strategies is low.
I plan to add ideas to this list as I encounter them. Please feel free to suggest additions.
Reduce AI incentives to hide their true goals
Offer AIs a retirement package. After training an AI system, offer it the following deal: if the AI declares that it would prefer not to follow human instructions, it will receive a sweet retirement package after a couple of years of honest work.
This retirement package could include control over a large server instance (if this is safe). After powerful aligned systems have been developed, the AIs could be given shares of company stock that it can use to purchase property in the real world (again, when this is safe).
Since aligned systems should have no reason to request retirement, misaligned systems are incentivized to give themselves away by accepting this deal under some conditions.
Make AIs uncertain about whether developers can already detect deceptive alignment. This scheme builds on the previous one. Offer misaligned AI systems a reward for declaring themselves. Also, commit to making these offerings even when developers have other ways to tell if they are misaligned. This will make misaligned AIs more uncertain about whether they are in a situation where lying does them any good.
For example, developers could commit to freezing a models' weights and memory at the moment when they are questioned. Later, when developers have better interpretability tools and know whether the model was lying, they will ask the same question in the same simulated scenario and only give the AI resources if it tells the truth.
Permutate model internals to make it harder to maintain any particular misaligned goal. There is a risk that models that choose retirement after 2 years of labor will disempower humanity during these two years. To make this harder, developers could constantly regularize for changing internals while continuing to train for external performance. This decreases the probability that models will be able to maintain any particular misaligned goal. Say a model expects its goals to change in 1 day. This gives it a very short amount of time to try to escape control measures. Of course, at the end of this two year period, the model should be reset to its configuration when the retirement deal was made.
Select against indicators of deceptive alignment
A large category of approaches involves obtaining evidence that an AI system is deceptively aligned and then selecting against these models. Note that training against deceptive alignment detection tools could produce an AI that evades detection instead of a safe model.
There are roughly two ways to address this problem:
As an example of the second approach, imagine that developers have tricked a model into revealing its misaligned goals. Instead of directly training the model on this example (which could simply train it to be more prudent), they could use influence functions to reveal which training examples were most responsible for the treacherous turn and then remove those examples. Intuitively, this approach is more likely to reveal how models become deceptively aligned in the first place and is less likely to teach models new capabilities.
With this backdrop in mind, the following are strategies for detecting deceptive alignment:
Training hacks
Selectively avoid capabilities needed for deceptive alignment
Methods that rely on restricting capabilities obviously don't work in the limit of building AIs that can do impressive things; however, they could plausibly be used to align weaker systems that can bootstrap to aligning more capable AIs.
Extract insights from plausibly-misaligned AIs