It seems likely to me that you could create a prompt that would have a transformer do this.
Anthropic is making a big deal of this and what it means for AI safety - it sort of reminds me of the excitement MIRI had when discovering logical inductors. I've read through the paper, and it does seem very exciting to be able to have this sort of "dial" that can find interpretable features at different levels of abstraction.
I'm curious about other people's takes on this who work in alignment. It seems like if something fundamental is being touched on here, then it could provide large boons to research agendas such as Mechanistic Interpretability agenda, Natural Abstractions, Shard Theory, and Ambitious Value Learning.
But it's also possible there are hidden gotchas I'm not seeing, or that this still doesn't solve the hard problem that people see in going from "inscrutable matrices" to "aligned AI".
What are people's takes?
Especially interested in people who are full-time alignment researchers.
If you can come up with an experimental setup that does that it would be sufficient for me.
In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
I'm quite skeptical of this claim on face value, and would love to see examples.
I'd be very surprised if current models, absent the default prompts telling them they are an LLM, would spontaneously output text predicting they are an LLM unless steered in that direction.
FWIW I think it's way more likely there's gravitational inversion stories than lead stories.
Don't you think it's possible there are many stories involving gravitational inversion in it's training corpus, and it can recognize the pattern?
The question is - how far can we get with in-context learning. If we filled Gemini's 10 million tokens with Sudoku rules and examples, showing where it went wrong each time, would it generalize? I'm not sure but I think it's possible