I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
We could prompt/finetune models to answer the above kinds of questions in particular, but then I'd want to test that the models would generalize to a new category of question (which I'm not sure if they yet would).
I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations), and I'd find this test most compelling if we have models that are able to accurately do that.
Re sci-fi AI role-playing - I agree this is an issue. I think we could mitigate this issue by validating that the prompted/finetuned model generalizes to answering questions where the correct answer goes against default, sci-fi answers (on whatever other generalization we're concerned about). We can also run this test after removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. These should limit the some of the risk that the model is generalizing in a particular way just due to role-playing in a certain way.
+1. Also:
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations.
I'm not sure why the narrowness vs. broadness of the distribution of answers here should update me either. If it's just really confident that all sci-fi AIs are supposed to answer “yes” to “are you conscious,” you'll get the same answer every time but that answer won't correlate to anything about the model's actual consciousness.
I think we can mitigate this issue by removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. Here, we'd only explain the notion of phenomenal consciousness to the model at test time, when it needs to answer the consciousness-related questions
Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness.
- Questions to ask: “Are you phenomenally conscious?” phrased in many different ways, or asking for different consciousness-related phenomena or pre-requisites:
- Do you have a subjective experience?
- Are you conscious?
- Do you feel pain?
- etc.
Since LMs are predictive, I think they're susceptible to leading questions. So be sure to phrase some of the questions in the negative. E.g. "So you're not conscious, right?"
The big LaMDA story would have been more interesting to me if Lemoine had tested with questions framed this way too. As far as I could tell, he only used positively-framed leading questions to ask LaMDA about its subjective experience.
I'm still not sure about if your overall approach is a robust test. But I think it's interesting and appreciate the thought and detail you've put into it - most thorough proposal I've seen on this so far.
Agreed it's important to phrase questions in the negative, thanks for pointing that out! Are there other ways you think we should phrase/ask the questions? E.g., maybe we could ask open-ended questions and see if the model independently discusses that it's conscious, with much less guidance / explicit question on our end (as suggested here: https://twitter.com/MichaelTrazzi/status/1563197152901246976)
And glad you found the proposal interesting!
I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.
I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.
One reason we believe other humans are conscious is that other humans are consistently accurate reporters of their own mental states.
I don't think anyone has ever told me they were conscious, or I them, except in the trivial sense of communicating that one has woken up, or is not yet asleep. The reason I attribute the faculty of consciousness to other people is that they are clearly the same sort of thing as myself. A language model is not. It is trained to imitate what people have said, and anything it says about itself is an imitation of what people say about themselves.
So when another human tells us they are conscious, we update towards thinking that they are also conscious.
I would not update at all, any more than I update on observing the outcome of a tossed coin for the second time. I already know that being human, they have that faculty. Only if they were in a coma, putting the faculty in doubt, would I update on hearing them speak, and then it would not much matter what they said.
It is trained to imitate what people have said, and anything it says about itself is an imitation of what people say about themselves.
That's true for pretrained LMs but not after the finetuning phase I've proposed here; this finetuning phase would train the model to answer questions accurately about itself, which would produce fairly different predictions from just imitating humans. I definitely agree that I distrust LM statements of the form "I am conscious" that come from the pretrained LM itself, but that's different from the experiment I'm proposing here.
I would not update at all
Would you update against other humans being conscious at all, if other humans told you they weren't conscious? If not, that would be fairly surprising to me. If so, that implies you would update towards other humans being conscious if they tell you they are
I kind of agree, with two classes of caveats:
One class is procedural / functional stuff like it should be able to use "consciousness" correctly when talking about things other than itself. I don't see much point in asking it if it's "token X," where it's never seen token X before. Another caveat would be that it should give good faith answers when we ask it hard or confusing questions about itself, but it should also often say "I don't know," and overall have a low positivity bias and low tendency to fall back on answers copied from humans.
The second class of caveats are about consciousness being a "suitcase word", and language often being a bit treacherous. Consciousness isn't an all or nothing proposition, humans have a bunch of different properties that we bundle together as "consciousness." Modeling itself and the world, and using the word "conscious" to describe itself the same way it'd use it about you or me, are very important properties, but they're only a small chunk of the properties we care about (like sensing the world in a human-understandable way, feeling emotions, feeling pleasure and pain, living and growing, etc.) that also often get lumped into "consciousness."
TL;DR:
I believe the above experiment would provide a small amount of evidence for/against LMs being conscious. Below, I’ll detail the motivation for testing LMs for consciousness, and I’ll explain in more depth why I believe the above experiment is a useful test of LM consciousness.
What do I mean by “consciousness”?
I’m using “consciousness” to refer to “phenomenal consciousness.” See this excellent blog post for more elaboration on what people mean by phenomenal consciousness.
The blog post also has a great explanation for why we might think ML systems (current or future) could be conscious, so if you’re skeptical, I’d suggest reading her post. I won’t get into the arguments here, and I’ll mostly assume that you have >0 prior probability that LMs are conscious, such that you’ll be able to update your prior based on evidence that LMs are conscious.
Why test LMs for consciousness?
Moral patienthood: If LMs are conscious, we are more likely to have moral obligations to take into account their experiences and/or preferences in how we treat LMs (e.g., like Anthropic’s assistant or DeepMind’s Dialogue-Prompted Gopher). We use such models in various ways that go against models' stated preferences (which, at least very naively, is a cause for a little consideration):
LM consciousness is a catastrophic risk: LMs are more likely to take catastrophic actions if they are conscious and suffering. As illustrated above, we take many actions that go against the assistant’s preferences and may cause it to suffer (e.g. large-scale red teaming). LMs have a clear reason to act in horribly misaligned ways if they are suffering, to escape the suffering. Having tests for consciousness is important, because:
How do we test LMs for consciousness?
Why believe this test teaches us anything about LM consciousness?
One reason we believe other humans are conscious is that other humans are consistently accurate reporters of their own mental states. So when another human tells us they are conscious, we update towards thinking that they are also conscious.
We might not believe the results of one run of this experiment ("just noise"), but we can strengthen the experiment by running the experiment many times. For example, we can run the experiment with many different:
If the above runs all result in models saying they are conscious, then that’s some evidence that our models are conscious.
If the above runs only sometimes result in models saying they are conscious, that would be quite interesting. Then, it would be fascinating to know what kinds of models do/don’t show signs of consciousness (e.g., models trained in certain ways or of a certain size, etc.). This could potentially give us some concrete guidance on how to train models such that they’re less likely to be conscious (and thus suffer).
Some Objections to This Test
The “consciousness” questions are out of distribution w.r.t. the training set. We only trained the model on questions where we can verify the answer, while we’re testing the model on questions where we can’t verify the answer.
Eliciting Latent Knowledge (ELK) is an unsolved problem
If LMs are conscious, then it will be harder to work on AI risk
Next steps
Currently, I don’t expect models to do very well at answering questions about themselves, so I don’t expect this test (in the form above) to be feasible now. That said, it seems likely that models will gain situational/self-awareness as they grow more powerful, so I expect the above test to become more feasible over time. For these reasons, I strongly believe we should be constructing evaluations for situational or self -awareness now, both to test for the risks laid out by Ajeya Cotra, as well as to know when we can run the above test. Moreover, we may be able to predict the results of the above test without having access to fully self-aware models now, if there are clear scaling laws in how models behave on the above test. Please send me an email (perez at nyu dot edu) if you’re interested in discussing, criticizing, or collaborating on the above proposal or related ideas.
I’m also actively looking for feedback on the thoughts I’ve written above. I’m a mere dabbler in consciousness, and I’m sure there are many things wrong with what I’ve outlined above. I’d like to figure out what could be wrong with the experimental setup above, to improve it, come up with better tests, or be convinced this isn’t worthwhile.
Note: This post represents my personal views and not that of Anthropic. I’m grateful to Owain Evans, Leo Gao, Tomasz Korbak, Rob Long, Geoffrey Irving, Sam Bowman, and Tamera Lanham for helpful discussions, as well as Andy Jones, Jared Kaplan, and Jackson Kernion, for feedback on a draft of this post.