Recently, the focus of mechanistic interpretability work has shifted to thinking about "representations", rather than strictly about entire algorithms
Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).
That said, great post! I really appreciated your conceptual frames.
Collapsable boxes are amazing. You should consider using them in your posts.
They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.
Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.
Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.
One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.
You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".
I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.
One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.
You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.
More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.
Lots of interesting ideas here, but the connection to alignment still seems a bit vague.
Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.
Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.
First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)
I wouldn't quite go so far as to say it "tackles" the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?
I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.
I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.