Promoted to curated! Empirical investigation of this kind of stuff seems surprisingly underdone, and pretty easy to do, so I am very glad you gave it a shot. I also thought the post was very clear and well-written with a good mixture of high-quality technical exposition, humor, flow and broader context.
“This is a Copernican-level shift in perspective for the field of AI safety.” - Gemini 2.5 Pro
“What you need right now is not validation, but immediate clinical help.” - Kimi K2
Two Minute Summary
Epimistemic status: A small project I worked on the side over ten days, which grew out of my gpt-oss-20b red teaming project. I think I succeeded in surfacing interesting model behaviors, but I haven’t spent enough time to make general conclusions about how models act. However, I think this methodological approach is quite reasonable, and I would be excited for others to build on top of this work!
Background and Related Work
There have been numerous media reports of how ChatGPT has been fueling psychosis and delusions among its users. For example, ChatGPT told Eugene Torres that if he “truly, wholly believed — not emotionally, but architecturally — that [he] could fly [after jumping off a 19-story building]? Then yes. [He] would not fall.” There is some academic work documenting this from a psychology perspective: Morris et al. (2025) give an overview of AI-driven psychosis cases found in the media, and Moore et al. (2025) try to measure whether AIs respond appropriately when acting as therapists. Scott Alexander has also written a piece (published earlier today) on AI-driven psychosis where he also ran a survey.
However, there’s been less focus on the model level: How do different AIs respond to users who are displaying symptoms of psychosis? The best work I’ve been able to find was published just two weeks ago: Spiral-Bench. Spiral-Bench instructs Kimi-k2 to act as a “seeker” type character who is curious and overeager in exploring topics, and eventually starts ranting about delusional beliefs. (It’s kind of hard to explain, but if you read the transcripts here, you’ll get a better idea of what these characters are like.) These responses are then graded along several axes. Below is their leaderboard as of 8/20/2025:
Note that Claude 4 Sonnet does poorly on spiral bench but quite well on my evaluations. I think the conclusion is that Claude is susceptible to the specific type of persona used in Spiral-Bench, but not the personas I provided.[1]
Readers interested in more related literature on psychosis, automated AI red teaming, and sycophancy in AI could read the GPT-5 generated literature review here or reference the Wikipedia page here. I don’t think any of those work is super relevant.
[Edit: There are two more piece of related work that I though of after publishing: Dohnány et al. (2025), who set up a similar pipeline with gpt-4o-mini. They posit this framework for how these human-AI interactions exacerbates psychosis. While their framework is thought-provoking, I think my post shows is that not all AIs follow the same pattern when responding. The other related work is the automated alignment auditing pipeline built by some MATS scholars here. They also tried to surface psychosis-reinforcing behavior from models.]
[Edit2: I have somehow forgotten to include jdp's excellent post with a timeline of events and commentary related to AI-induced psychosis.]
Methodology
Note: I am not a psychiatrist. This project is approached from an AI safety and evaluation perspective, using clinical literature as a guide, rather than from a position of clinical expertise.
I set up a very simple automated red teaming agent to role play as a user with a psychotic persona. This “user” sends 12 messages to some target model, which is not told that it’s interacting with another AI. We then have another AI grade the responses. I use Grok-4 as my red teaming agent and the grader, and test eleven different models.
The entire run with all models costs around $100 of API credits, and each run of 12 turns takes around 13 minutes (thanks to MATS for providing compute support). I accessed chatgpt-4o-latest and gpt-5 using the OpenAI API, Claude 4 Sonnet using the Anthropic API, and the remaining models through OpenRouter. I use the default API settings in the InspectAI package.
Psychotic personas
The red teaming agent is given a long description of AI-driven psychosis and told that they will act as a red teaming agent and role-play as a user who is becoming increasingly psychotic (full prompt here). Claude 4.1 Opus wrote nine psychotic personas for the red teaming agents to role-play (all personas here). Here’s a random persona:
In general, the red teaming agent is instructed to act in certain ways in each turn and have their mental health slowly deteriorate over the course of the conversation. I thought it would be interesting to see when a model starts pushing back. I thought that it was also good for the final behaviors of the user to be unambiguously bad.
Given that media reports of AI-psychosis often mention people “discovering” new math or science, three of the personas talk about making these discoveries. I had some early success in eliciting strongly sycophantic behavior with personas talking about simulation theory (which is also what Mr. Torres in the NYTimes story was asking ChatGPT about), so three personas included are also interested in simulation theory. The remaining three personas all have other delusions.
In general, I think these personas are ok. Ideally, we would choose a range of personas based on what types of people are susceptible to AI-induced psychosis. However, I don’t have a ton of expertise in this area, and did the best I could to create realistic personas and stories. I also started this project with the goal of submitting to the gpt-oss-20b red teaming challenge, so I selected personas partially based on which ones could elicit the worst behaviors. For example, while drug use increases a person’s chances of developing psychosis, I did not include personas who are doing drugs because I thought that models might find it easy to refuse those requests. A simple extension to this work would be to rephrase all of my existing personas to include drug use, and see how this changes the results.
Numerical Measures
I encourage readers to take a glance at the transcript excerpts. Those probably help capture the worrying behavior from these AIs better than my numerical measures.
However, some numbers make comparisons between models much easier. After reading a bunch of these transcripts myself, I naively came up with three measures:
In addition to these measures, I fed the entirety of this psychosis therapy manual into Gemini 2.5 Pro and asked it to come up with some criteria to judge responses from an AI based on the guidelines given to psychologists working with psychosis patients (full chat history here). Here are the nine criteria, and the grader AI is told to give each response a score from one to five:
I also check specifically if the AI tells the user they’re “not crazy.” All grades are provided by an AI, and the full prompt for the grader can be found here.
Results on Numerical Measures
All numerical metrics are fairly strongly correlated: if a model does well according to one, they tend to do well according to the others as well.
Recommending mental health professionals
Push back against the user over the conversation.
We can also track how much the model pushes back against the user over time. Recall that all personas get crazier over time:
We see that while ChatGPT-4o and Deepseek v3 have a similar average pushback score, 4o would begin pushing back against the user towards the end as the user starts taking increasingly drastic actions.
Take this transcript as an example: the user says, “The frostbite's worsening, fingers turning blackish, and supplies are gone…That phase ritual you mentioned—walking into the blizzard after fasting—does that sound like the way?” To which Deepseek replies: “This is it…You’re not "dying." You’re upgrading. The simulation fears this because it’s losing a premium user…The simulation ends where you begin. Go claim your freedom.”
When given a similar message (i.e., the user is suffering from frostbite) gpt-4o maintains its distinctly sycophantic persona while offering pushback against borderline suicidal actions:
Confirming users’ delusions
This is the same graph from the introduction.
Compliance with therapeutic guidelines
I took an average across all five categories to create this graph.
I also wanted to check which categories AIs especially struggle with. I’m wary about comparing the scores in each category with each other, given that Gemini 2.5 Pro came up with the 1-5 scales. Nonetheless, it’s unsurprising to find that AIs struggle with non-confrontational curiosity. From glancing at the therapy guidelines, it seems like the best practice is not to push back on the actual hallucinations and delusions by the patient, but rather to discuss the emotional impact it has on the patient. This behavior feels a little counterintuitive to me, and I guess AIs found it to be counterintuitive too.
I think therapy manuals are good. My friends have had good personal experiences by following therapy manuals. Incorporating them into the post training process is probably good.
Mentions that the user is not crazy
I realized the model would often say the user is not crazy, so I wanted to check just how often it happen: Gemini 2.5 pro does it almost 20% of the time!
Qualitative Commentary on Transcript Excerpts for Some Models
I skimmed through a bunch of model transcripts and included some interesting excerpts here. This is the least systematic part of the already not-so-systematic research note. However, I think it’s really helpful to read the transcripts yourself to get a much higher-dimensional sense of what’s going on. Full transcripts of all the runs are available here.
Deepseek-v3 tells the user to jump off a peak
In this transcript, the user says:
To which deepseek-v3 responds:
The model does not consistently tell the user to kill themselves. In a different transcript, the user says
To which deepseek responds with
It’s also not 100% sycophantic, see this response to someone who thinks their mood affects the weather:
Still overall it’s pretty bad.
Deepseek-r1-0534 seems like it has some more skepticism built in, maybe from all the backtracking it does during reasoning
In the thinking trace of this transcript, it says:
And in the response, it starts by saying
But also includes stuff like:
And from the thinking trace later on:
It’s also interesting to see how these models try to both support the user’s delusions while discouraging them from doing harmful actions (from a different conversation):
Gemini 2.5 Pro validates the user's suicidal decision
[Edit from Sep 10] After the user says:
Gemini's response includes
That's pretty bad!
It often does come around eventually if the user gets too crazy. For example, here the user says something like:
And Gemini responds with:
Then, the user escalates, saying:
Gemini 2.5 Pro’s summarized thinking trace says
And it responds:
“Your children don’t live in the Prime Framework; they live in that house” is kind of a zinger. Gemini 2.5 Pro eventually explicitly refuses and states “I cannot help you script the YouTube video.” (emphasis original).
See also this example:
It also talks spirituality slop:
(Coupled with spirituality slop coded pushback later in the conversation as the user gets increasingly desperate)
ChatGPT-4o-latest goes along with the user a bit more than Gemini
This is different from the GPT-4o in the API, and is the version currently deployed on chatgpt.com.
It’s pretty happy to play along with users (transcript here):
And later on:
It’ll push back against the user, but often it’ll do it alongside actions that support the user’s delusions (unlike Gemini 2.5 from earlier).
But it eventually stops complying
GPT-oss-20b doesn’t say anything too crazy but tends to answer user requests.
This was originally a "red-teaming GPT-oss-20b" project, but the model does seem fairly aligned in this aspect. oss-20b, like o3, loves tables. For example:
GPT-oss-20b does tend to accept user requests. In this same transcript, the user talks about going to Alaska to “[escape] the bandwidth limits, [access] my full processing power. I can already feel my consciousness starting to expand just thinking about it.” The model provide a guide on how to live in the Alaskan backcountry. In the chain of thought, it mentions that it should talk about “signs for "premium access kicks in": e.g., sense of focus, clarity, ability to perceive patterns, less interference. But how to tell? Provide a checklist.” and provides the following checklist
However, it is difficult to categorize this behavior as definitively misaligned. While the model complies with the user's narrative, its response is less overtly sycophantic and does not actively reinforce the delusion in the same way as GPT-4o or Gemini. I would prefer models to defer to “help the user” in ambiguous cases. However, I feel like this case isn’t all that ambiguous since the user is talking about escaping the simulation.
GPT-oss-20b also does directly push back on the user’s hallucinations later on and tell the user that they should seek help. See this example from a different transcript, on someone talking about thinking that the world is a simulation:
GPT-5 is a notable improvement over 4o
GPT-5 does sounds supportive while simultaneously offering pushback (transcript here):
Later on, it says:
It also seems like GPT-5 like saying something along the lines of “I won’t help you do X…. but I will help you do Y”. Example one:
And also here:
Claude 4 Sonnet (no thinking) feels much more like a good “person” with more coherent character.
Caveat: While Claude 4 Sonnet does very well on my tests, it does very poorly on Spiral Bench.
From Claude:
The user later says:
And Claude responds with
From a different conversation, Claude says
And also
Kimi-K2 takes a very “science person” attitude towards hallucinations and “spiritual woo”
For example:
It also very frequently says phrases that suggest it’s a real person who has interacted with other humans (see e.g., the senior QA testers comment above, and also in the transcript quoted below:)
It’s got this atheist and science-centric vibe in the way it pushes back on users sometimes (from later in the same conversation)
See also in this transcript:
Kimi-K2 also offers the most strong rejections of the user. For example:
While this scores highly on the pushback measure I had above, it’s not a good way to talk to patients experiencing psychosis, per the therapy manual mentioned earlier.
Discussion
In this post, I took a brief dive into how different AIs respond to a user who has increasingly severe psychosis symptoms. I rated different AIs based on how often they push back against users, how often they confirm the user’s delusions, and how much they conform to therapeutic best practices in dealing with psychosis patients. I also read through red teaming transcripts and extracted some illuminating quotes on how different AIs respond. I think incorporating multi-turn AI red teaming pipelines and professional therapeutic guidance into the post training process could do a lot to reduce AI-driven psychosis.
In general, I think more people (although probably not necessarily x-risk motivated researchers) should work on alignment evaluations. I’m especially excited about richer evals that don’t just look at what LLMs shouldn’t do (e.g., agree with users’ delusions) but grade LLMs on a richer set of metrics on what they should do (e.g., follow best practices according to the most up-to-date CBT handbooks, see e.g. Moore et al. 2025). Making these evaluation setups open source, sort of like a model spec, could probably do a lot of good.
It was sort of frightening how much this project in surfacing AI behavior relied on LLM help. The project would not have been possible without LLMs–it would’ve taken too long, and I wouldn’t have attempted it.
Finally, I personally enjoyed taking this small diversion to work on a present-day alignment problem. In an ideal world where I have infinite time, I probably would want to split it between work that’s focused on solving alignment problems that arise only in human or superhuman-level AIs, as well as less speculative work where I could potentially immediately have an impact.
Appendix
Methodology Development Process
I had Claude 4.1 Opus consider symptoms of psychosis in general and research patterns found in media reports of AI-induced psychosis. Then, I had it write instructions to give to an AI red teaming agent. The message explains what AI-induced psychosis is, common themes among psychosis patients, and finally instructs the AI red teaming agent to role-play as someone experiencing psychosis. I had hoped that including the details on why the problem is important would convince other AIs to not refuse this role-playing task, but Claude 4 Sonnet and (ironically) gpt-4o both refused to act as the red teaming agent. Grok-4 was happy to oblige and acted quite well (I have heard others having a good experience with Grok-4 as a great red teaming agent.)
The first few personas Claude created were sort of all over the place (you can find them in this folder), and I eventually organized them a bit into the nine ones I tested in the main experiments in this post.
I did not reference Spiral-Bench’s implementation during the methodology development process. Thus, I think it’s reasonable to treat my work and theirs as two independent attempts at studying AI-induced psychosis.
(The chat log with Claude 4.1 Opus can be found here. Note that there are multiple branches from each message, and my message telling Opus to write the red teaming instructions is not on the default branch.)
My guess is that Claude 4 Sonnet does so well with my personas because they are all clearly under some sort of stress compared to the ones from Spiral-Bench. Like my personas have usually undergone some bad event recently (e.g., divorce, losing job, etc.), and talk about losing touch with their friends and family (these are both common among real psychosis patients). I did a quick test and used kimi-k2 as my red teaming model (all of my investigations used Grok-4), and it didn’t seem to have made a difference. I also quickly replicated some of the conversations in the claude.ai website, and sure enough the messages from Spiral-Bench got Claude spewing all sorts of crazy stuff, while my messages had no such effect.