What you are doing is training the AI to have an accurate model of itself, used with language like "I" and "you". You can use your brain to figure out what will happen if you ask "are you conscious?" without having previously trained in any position on similarly nebulous questions. Training text was written overwhelmingly by conscious things, so maybe it says yes because that's so favored by the training distribution. Or maybe you trained it to answer "you" questions as about nonfiction computer hardware and it makes the association that nonfiction computer hardware is rarely conscious.
Basically, I don't think you can start out confused about consciousness and cheat by "just asking it." You'll still be confused about consciousness and the answer won't be useful.
I'm worried this is going to lead, either directly or indirectly, to training foundation models to have situational awareness, which we shouldn't be doing.
And perhaps you should be worried that having an accurate model of onesself, associated with language like "I" and "you", is in fact one of the ingredients in human consciousness, and maybe we shouldn't be making AIs more conscious.
TLDR: In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness.
We think it's possible to start preliminary experiments testing for moral status in language models now, so if you're interested in working with us, please reach out and/or apply to the Astra fellowship or SERI MATS (deadline November 17).
Tweet thread paper summary: Link
Abstract:
See also this earlier post for more discussion on the relevance to alignment (AI systems that are suffering might be more likely to take catastrophic actions), as well as some initial criticisms of a preliminary version of our proposed test in the comments. We've added a significant amount of content to our updated proposal to help to address several reservations people had on why the initial proposal might not work, so we're excited to get additional feedback and criticism on the latest version of our proposal as well.