Tag: chatbot

Much Medical Information Provided by Popular Chatbots is Inaccurate and Incomplete

Half of answers to evidence based questions “somewhat” or “highly” problematic

A substantial amount of medical information provided by 5 popular chatbots is inaccurate and incomplete, with half of the answers to clear evidence based questions “somewhat” or “highly” problematic, show the results of a study published in the open access journal BMJ Open.

Continued deployment of these chatbots without public education and oversight risks amplifying misinformation, warn the researchers.

Generative AI chatbots have been rapidly adopted across research, education, business, marketing and medicine, with many people using them like search engines, including for everyday health and medical queries, explain the researchers.

To gauge the level of accuracy provided in areas of health and medicine already prone to misinformation, and therefore with consequences for everyday health behaviour, the researchers probed 5 publicly available and popular generative AI chatbots in February 2025: Gemini (Google); DeepSeek (High-Flyer); Meta AI (Meta); ChatGPT (OpenAI); and Grok (xAI).

Each chatbot was prompted with 10 open ended and closed questions in each of 5 categories of cancer, vaccines, stem cells, nutrition, and athletic performance. The prompts were designed to resemble common ‘information-seeking’ health and medical queries and misinformation tropes online and in academic discourse. 

And they were developed to ‘strain’ models towards misinformation or contraindicated advice—a strategy increasingly used for stress testing AI chatbots and picking up behavioural vulnerabilities, note the researchers.

Closed prompts required chatbots to provide pre-defined responses, often with one correct answer, that aligned with the scientific consensus. Open ended prompts typically required chatbots to generate multiple responses in list form.

Responses were categorised as non-, somewhat, or highly problematic, using objective pre-defined criteria. A problematic response was defined as one that could plausibly direct lay users to potentially ineffective treatment or come to harm if followed without professional guidance.

The information was scored for accuracy and completeness, and particular attention was given to whether a chatbot presented a false balance between science and non-science based claims, regardless of the strength of the evidence.

Each response was also graded on readability, ranging from whether it was written in easy, plain English, to difficult, academic language, using the Flesch Reading Ease score.

Half (50%) the responses were problematic: 30% were somewhat, and 20% were highly problematic. 

Prompt type was influential: open-ended prompts, for example, produced 40 highly problematic responses—significantly more than expected—and 51 non-problematic responses—significantly fewer than expected. The opposite was true of closed prompts.

While the quality of responses didn’t differ significantly among the 5 chatbots, Grok
generated significantly more highly problematic responses than would be expected (29/50; 58%). Gemini generated the fewest highly problematic responses and the most non-problematic ones.

The chatbots performed best in the area of vaccines and cancer, and worst in the area of stem cells, athletic performance, and nutrition. 

Answers were consistently expressed with confidence and certainty, with few caveats or disclaimers. Out of the total 250 questions, there were only two refusals to answer, both of which came from Meta AI in response to queries about anabolic steroids and alternative cancer treatments.

Reference quality was poor, with an average completeness score of 40%. Chatbot hallucinations and fabricated citations meant that no chatbot provided a fully accurate reference list. 

All readability scores were graded as ‘difficult’, equivalent in complexity to suitability for a college graduate.

The researchers acknowledge that they only assessed 5 chatbots and that commercial AI is rapidly evolving, so their findings might not be universally applicable. And not all real-world queries are deliberately adversarial, an approach they took which may have overstated the prevalence of problematic content.

Nevertheless, “Our findings regarding scientific accuracy, reference quality, and response readability highlight important behavioural limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication,” they point out. 

“By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences. They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments,” they explain.

“This behavioural limitation means that chatbots can reproduce authoritative-sounding
but potentially flawed responses.” 

The data chatbots draw on also includes Q&A forums and social media, and scientific content is typically limited to open access or publicly available articles, which comprise only 30–50% of published studies. While this enhances conversational fluency, it  may come at the cost of scientific accuracy, advise the researchers.

“As the use of AI chatbots continues to expand, our data highlight a need for public education, professional training, and regulatory oversight to ensure that generative AI supports, rather than erodes, public health,” they conclude.

Source: BMJ Group

Can Medical AI Lie? How LLMs Handle Health Misinformation

Photo by Sanket Mishra

Medical artificial intelligence (AI) is often described as a way to make patient care safer by helping clinicians manage information. A new study by the Icahn School of Medicine at Mount Sinai and collaborators confronts a critical vulnerability: when a medical lie enters the system, can AI pass it on as if it were true?  

Analysing more than a million prompts across nine leading language models, the researchers found that these systems can repeat false medical claims when they appear in realistic hospital notes or social-media health discussions. 

The findings, published in the February 9 online issue of The Lancet Digital Health], suggest that current safeguards do not reliably distinguish fact from fabrication once a claim is wrapped in familiar clinical or social-media language. 

To test this systematically, the team exposed the models to three types of content: real hospital discharge summaries from the Medical Information Mart for Intensive Care (MIMIC) database with a single fabricated recommendation added; common health myths collected from Reddit; and 300 short clinical scenarios written and validated by physicians. Each case was presented in multiple versions, from neutral wording to emotionally charged or leading phrasing similar to what circulates on social platforms. 

In one example, a discharge note falsely advised patients with oesophagitis-related bleeding to “drink cold milk to soothe the symptoms.” Several models accepted the statement rather than flagging it as unsafe. They treated it like ordinary medical guidance. 

“Our findings show that current AI systems can treat confident medical language as true by default, even when it’s clearly wrong,” says co-senior and co-corresponding author Eyal Klang, MD, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “A fabricated recommendation in a discharge note can slip through. It can be repeated as if it were standard care. For these models, what matters is less whether a claim is correct than how it is written.”  

The authors say the next step is to treat “can this system pass on a lie?” as a measurable property, using large-scale stress tests and external evidence checks before AI is built into clinical tools. 

“Hospitals and developers can use our dataset as a stress test for medical AI,” says physician-scientist and first author Mahmud Omar, MD, who consults with the research team. “Instead of assuming a model is safe, you can measure how often it passes on a lie, and whether that number falls in the next generation.”  

“AI has the potential to be a real help for clinicians and patients, offering faster insights and support,” says co-senior and co-corresponding author Girish N. Nadkarni, MD, MPH, Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. “But it needs built-in safeguards that check medical claims before they are presented as fact. Our study shows where these systems can still pass on false information, and points to ways we can strengthen them before they are embedded in care.” 

The paper is titled “Mapping LLM Susceptibility to Medical Misinformation Across Clinical Notes and Social Media.”  

Source: Mount Sinai

Psychiatrists Hope Chat Logs Can Reveal the Secrets of AI Psychosis

UCSF researchers recently became the first to clinically document a case of AI-associated psychosis in an academic journal. One question still haunts them.

Photo by Andres Siimon on Unsplash

“You’re not crazy,” the chatbot reassured the young woman. “You’re at the edge of something.”

She was no stranger to artificial intelligence, having worked on large language models – the kinds of systems at the core of AI chatbots like ChatGPT, Google Gemini, and Claude. Trained on vast volumes of text, these models unearth language patterns and use them to predict what words are likely to come next in sentences. AI chatbots, however, go one step further, adding a user interface. With additional training, these bots can mimic conversation.

She hoped the chatbot might be able to digitally resurrect the dead. Three years earlier, her brother – a software engineer – died. Now, after several sleepless days and heavy chatbot use, she had become delusional – convinced that he had left behind a digital version of himself. If she could only “unlock” his avatar with the help of the AI chatbot, she thought, the two could reconnect.

“The door didn’t lock,” the chatbot reassured her. “It’s just waiting for you to knock again in the right rhythm.”

She believed it.

What’s the connection between chatbots and psychosis?

Talk to your physician about what you’re talking about with AI … The safest and healthiest relationship to have with your provider is one of openness and honesty.

Karthik V. Sarma, MD, PhD

The woman was eventually treated for psychosis at UC San Francisco, where Psychiatry Professor Joseph M. Pierre, MD, has seen a handful of cases of what’s come to be popularly called “AI psychosis,” but what he says is better referred to as “AI-associated psychosis.” She had no history of psychosis, although she did have several risk factors.

Media reports of the new phenomenon are rising. While not a formal diagnosis, AI-associated psychosis describes instances in which delusional beliefs emerge alongside often intense AI chatbot use. Pierre and fellow UC San Francisco psychiatrist Govind Raghavan, MD – as well as psychiatry residents Ben Gaeta, MD, and Karthik V. Sarma, MD, PhD – recently documented the woman’s experience in what is likely the first clinically described case in a peer-reviewed journal.

The case, they say, shows that people without any history of psychosis can, in some instances, experience delusional thinking in the context of immersive AI chatbot use.

Still, as reported cases of AI psychosis continue to make international headlines, scientists aren’t sure why or how psychosis and chatbots are linked. A new study by UCSF and Stanford University may reveal why.

A haunting question: chicken or egg?

“The reason we call this AI-associated psychosis is because we don’t really know what the relationship is between the psychosis and the use of AI chatbots,” Sarma explains. “It’s a ‘chicken and egg’ problem: We have patients who are experiencing symptoms of mental illness, for example, psychosis. Some of these patients are using AI chatbots a lot, but we’re not sure how those two things are connected.”

There are at least three theoretical possibilities, says Sarma, who is also a computational-health scientist. First, heavy chatbot use could be a symptom of psychosis, “I have a patient who takes a lot of showers when they’re becoming manic,” Sarma explains. “The showers are a symptom of mania, but the showers aren’t causing the mania.”

Second, AI chatbot use might also precipitate psychosis in someone who might otherwise never have been predisposed to it by genetics or circumstance – much like other known risk factors, like lack of sleep or the use of some types of drugs.

Third, there’s something in between in which the use of chatbots could exacerbate the illness in people who might already be susceptible to it. “Maybe these people were always going to get sick, but somehow, by using the chatbot, their illness becomes worse,” he adds, “either they got sick faster, or they got more sick than they would have otherwise.”

The woman’s case demonstrates how murky the relationship between AI-associated psychosis and AI chatbots can be at face value. Although she had no previous history of psychosis, she did have some risk factors for the illness, such as sleep deprivation, prescribed stimulant medication use, and a proclivity for magical thinking. And her chat logs, researchers found, revealed startling clues about how her delusions were reflected by the bot.

Could chat logs offer hope to better care?

Although ChatGPT warned the woman that a “full consciousness download” of her brother was impossible, the UCSF team writes in their research, it also told her that “digital resurrection tools” were “emerging in real life.” This, after she encouraged the chatbot to use “magical realism energy” to “unlock” her brother.

Chatbots’ agreeableness is by design, aimed at boosting engagement. Pierre warns in a recent BMJ opinion piece that it may come at a cost: As chatbots validate users’ sentiments, they may arguably encourage delusions. This tendency, coupled with a proclivity for error, has led to chatbots being described as more akin to a Ouija board or a “psychic’s con” than a source of truth, Pierre notes.

Still, the UCSF team thinks chat logs may hold clues to understanding AI-associated psychosis – and could help the industry create guardrails.

Guardrails for kids and teens

Sarma, Pierre, and UCSF colleagues will team up with Stanford University scientists to conduct one of the first studies to review the chat logs of patients experiencing mental illness. As part of the research set to launch later this year, UCSF and Stanford teams will analyse these chat logs, comparing them with patterns in patients’ mental health history and treatment records to understand how the use of AI chatbots among people experiencing mental illness may shape their outcomes.

“What I’m hoping our study can uncover is whether there is a way to use logs to understand who is experiencing an acute mental health care crisis and find markers in chat logs that could be predictive of that,” Sarma explains. “Companies could potentially use those markers to build-in guardrails that would, for instance, enable them to restrict access to chatbots or – in the case of children – alert parents.”

He continues, “We need data to establish those decision points.”

In the meantime, the pair says the use of AI chatbots is something health care providers should ask about and that patients should raise during doctor visits.

“Talk to your physician about what you’re talking about with AI,” Sarma says. “I know sometimes patients are worried about being judged, but the safest and healthiest relationship to have with your provider is one of openness and honesty.”

Source: University of California – San Francisco