Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio.
Summary
In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue...
I continue to be excited about this class of approaches. To explain why is roughly to give an argument for why I think self-other overlap is relevant to normative reasoning, so I will sketch that argument here:
Just an interesting philosophical argument
Why can an ML model learn from part of a distribution or data set, and generalize to the rest of it? Why can I learn some useful heuristics or principles in a particular context, and later apply them in other areas of my life?
The answer is obvious: because there are some underlying regularities between the parts I train on and the ones I test on. In the ML example, generalization won't work when approximating a function which is a completely random jumble of points.
Also, quantitatively, the more regular the function is, the better generalization will work. For example, polynomials of lower degree require less data points to pin down. Same goes for periodic functions. Also, a function with lower Lipschitz constant will...
This post still stands out to me as making an important and straightforward point about observer dependence of knowledge that is still, in my view, under appreciated (enough so that I wrote a book about it and related epistemological ideas!). I continue to think this is quite important for understanding AI, and in particular addressing interpretability concerns as they relate to safety, since lacking a general theory of why and how generalization happens, we may risk mistakes in building aligned AIs if they categorize the world in usual ways that we don't anticipate or understand.
What problem is Thought Anchors solving for you (or future users)? I feel like I don't quite understand.
Looks like it isn't specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
Thank you to Neel for writing this. Most people pivot quietly.
I've been most skeptical of mechanistic interpretability for years. I excluded interpretability in Unsolved Problems in ML Safety for this reason. Other fields like d/acc (Systemic Safety) were included though, all the way back in 2021.
Here's are some earlier criticisms: https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Transparency
More recent commentary: https://ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability
I think the communi...
A hand-drawn presentation on the idea of an 'Untrollable Mathematician' - a mathematical agent that can't be manipulated into believing false things.
I agree it's kind of difficult.
Have you seen Nicholas Carlini's Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5.
Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already -... (read more)