x
Principled Interpretability of Reward Hacking in Closed Frontier Models — AI Alignment Forum