Anthropic is making a big deal of this and what it means for AI safety - it sort of reminds me of the excitement MIRI had when discovering logical inductors. I've read through the paper, and it does seem very exciting to be able to have this sort of "dial" that can find interpretable features at different levels of abstraction.
I'm curious about other people's takes on this who work in alignment. It seems like if something fundamental is being touched on here, then it could provide large boons to research agendas such as Mechanistic Interpretability agenda, Natural Abstractions, Shard Theory, and Ambitious Value Learning.
But it's also possible there are hidden gotchas I'm not seeing, or that this still doesn't solve the hard problem that people see in going from "inscrutable matrices" to "aligned AI".
What are people's takes?
Especially interested in people who are full-time alignment researchers.
In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
I'm quite skeptical of this claim on face value, and would love to see examples.
I'd be very surprised if current models, absent the default prompts telling them they are an LLM, would spontaneously output text predicting they are an LLM unless steered in that direction.
I think this is one of the most important questions we currently have in relation to time to AGI, and one of the most important "benchmarks" that tell us where we are in terms of timelines.