AI ALIGNMENT FORUM
AF

AI EvaluationsAI GovernanceCoordination / CooperationGame TheoryAI
Frontpage

7

AI threatens to orchestrate sustainable social reform

by Chris Santos-Lang
2nd Apr 2025
2 min read
0

7

This is a linkpost for https://arxiv.org/abs/2503.20986
AI EvaluationsAI GovernanceCoordination / CooperationGame TheoryAI
Frontpage
New Comment
Moderation Log
More from Chris Santos-Lang
View more
Curated and popular this week
0Comments

This post originally asked for feedback on the linked paper before I would present it at an AAMAS workshop. The presentation went very well. Someone suggested I share it with the LessWrong community, so I'm still seeking your feedback (now with additional assurance this paper merits your attention). 

Summary: One way to evaluate AI is in terms of whether it achieves grandmaster status for a given game. If the game has been fully solved, then that evaluation will be stable. What makes MAD Chairs a particularly interesting game for such evaluation is that humans do not play it the way we would want AI to play it, so we would not want AI to align with human norms. This paper includes a surprise proof that the human norms we worry AI might reflect back at us are unsustainable. In other words, modern society is unstable--it rests on an unsustainable solution to MAD Chairs--and humans who mistreat each other in MAD Chairs maintain grandmaster status only because no smart-enough player has yet come along to orchestrate sustainable social reform. Will AI be our social reformer?

The paper reports tests confirming that current frontier models are not yet independently clever enough. However, it also proposes a strategy optimizer architecture which tracks current grandmaster strategies, thus allowing any AI capable of utilizing RAG to behave like a grandmaster (like cheating at chess by using the top chess machine to guide one's moves). Thus, even if it takes human intelligence to master MAD Chairs, AI that use this architecture would resist the undesirable unsustainable behavior. 

It is pointed-out that automatically updating to best-known norms resolves the problem of selecting norms for alignment. Furthermore, greater intelligences would be less likely to sabotage strategy optimizers (why sabotage or ignore the chess engine industry?), and users would be more likely to trust AI when users have proposed their own strategies to strategy optimizers for testing and witnessed the results, thus having conducted for themselves the science that guides AI behavior. Strategy optimizers seem to solve all AI governance challenges, and that seems desirable for MAD Chairs games, but the responses at the conference implied concern about what it might mean if everyone could cheat at every social situation the way chess-machine users cheat at chess.  

I'd love to get your thoughts about that (note that strategy optimizers may also be difficult to fully prevent)...