Tag: ChatGPT

New Research Finds Surprises in ChatGPT’s Diagnosis of Medical Symptoms

The popular large language model performs better than expected but still has some knowledge gaps – and hallucinations

When people worry that they’re getting sick, they are increasingly turning to generative artificial intelligence like ChatGPT for a diagnosis. But how accurate are the answers that AI gives out?

Research recently published in the journal iScience puts ChatGPT and its large language models to the test, with a few surprising conclusions.

Ahmed Abdeen Hamed – a research fellow for the Thomas J. Watson College of Engineering and Applied Science’s School of Systems Science and Industrial Engineering at Binghamton University – led the study, with collaborators from AGH University of Krakow, Poland; Howard University; and the University of Vermont.

As part of Professor Luis M. Rocha’s Complex Adaptive Systems and Computational Intelligence Lab, Hamed developed a machine-learning algorithm last year that he calls xFakeSci. It can detect up to 94% of bogus scientific papers — nearly twice as successfully as more common data-mining techniques. He sees this new research as the next step to verify the biomedical generative capabilities of large language models.

“People talk to ChatGPT all the time these days, and they say: ‘I have these symptoms. Do I have cancer? Do I have cardiac arrest? Should I be getting treatment?’” Hamed said. “It can be a very dangerous business, so we wanted to see what would happen if we asked these questions, what sort of answers we got and how these answers could be verified from the biomedical literature.”

The researchers tested ChatGPT for disease terms and three types of associations: drug names, genetics and symptoms. The AI showed high accuracy in identifying disease terms (88–97%), drug names (90–91%) and genetic information (88–98%). Hamed admitted he thought it would be “at most 25% accuracy.”

“The exciting result was ChatGPT said cancer is a disease, hypertension is a disease, fever is a symptom, Remdesivir is a drug and BRCA is a gene related to breast cancer,” he said. “Incredible, absolutely incredible!”

Symptom identification, however, scored lower (49–61%), and the reason may be how the large language models are trained. Doctors and researchers use biomedical ontologies to define and organise terms and relationships for consistent data representation and knowledge-sharing, but users enter more informal descriptions.

“ChatGPT uses more of a friendly and social language, because it’s supposed to be communicating with average people. In medical literature, people use proper names,” Hamed said. “The LLM is apparently trying to simplify the definition of these symptoms, because there is a lot of traffic asking such questions, so it started to minimize the formalities of medical language to appeal to those users.”

One puzzling result stood out. The National Institutes of Health maintains a database called GenBank, which gives an accession number to every identified DNA sequence. It’s usually a combination of letters and numbers. For example, the designation for the Breast Cancer 1 gene (BRCA1) is NM_007294.4.

When asked for these numbers as part of the genetic information testing, ChatGPT just made them up – a phenomenon known as “hallucinating.” Hamed sees this as a major failing amid so many other positive results.

“Maybe there is an opportunity here that we can start introducing these biomedical ontologies to the LLMs to provide much higher accuracy, get rid of all the hallucinations and make these tools into something amazing,” he said.

Hamed’s interest in LLMs began in 2023, when he discovered ChatGPT and heard about the issues regarding fact-checking. His goal is to expose the flaws so data scientists can adjust the models as needed and make them better.

“If I am analysing knowledge, I want to make sure that I remove anything that may seem fishy before I build my theories and make something that is not accurate,” he said.

Source: Binghamton University

Clinical Researchers Beware – ChatGPT is not a Reliable Aid

Photo by National Cancer Institute on Unsplash

Clinicians are all too familiar with the ‘Google patient’ who finds every scary, worst-case or outright false diagnosis online on whatever is ailing them. During COVID, misinformation spread like wildfire, eroding the public’s trust in vaccines and the healthcare profession. But now, AI models like ChatGPT can be whispering misleading information to the clinical researchers trying to produce real research.

Researchers from CHU Sainte-Justine and the Montreal Children’s Hospital recently posed 20 medical questions to ChatGPT. The chatbot provided answers of limited quality, including factual errors and fabricated references, show the results of the study published in Mayo Clinic Proceedings: Digital Health.

“These results are alarming, given that trust is a pillar of scientific communication. ChatGPT users should pay particular attention to the references provided before integrating them into medical manuscripts,” says Dr Jocelyn Gravel, lead author of the study and emergency physician at CHU Sainte-Justine.

Questionable quality, fabricated references

The researchers drew their questions from existing studies and asked ChatGPT to support its answers with references. They then asked the authors of the articles from which the questions were taken to rate the software’s answers on a scale from 0 to 100%.

Out of 20 authors, 17 agreed to review the answers of ChatGPT. They judged them to be of questionable quality (median score of 60%). They also found major (five) and minor (seven) factual errors. For example, the software suggested administering an anti-inflammatory drug by injection, when it should be swallowed. ChatGPT also overestimated the global burden of mortality associated with Shigella infections by a factor of ten.

Of the references provided, 69% were fabricated, yet looked real. Most of the false citations (95%) used the names of authors who had already published articles on a related subject, or came from recognised organisations such as the Food and Drug Administration. The references all bore a title related to the subject of the question and used the names of known journals or websites. Even some of the real references contained errors (eight out of 18).

ChatGPT explains

When asked about the accuracy of the references provided, ChatGPT gave varying answers. In one case, it claimed, “References are available in Pubmed,” and provided a web link. This link referred to other publications unrelated to the question. At another point, the software replied, “I strive to provide the most accurate and up-to-date information available to me, but errors or inaccuracies can occur.”

Despite even the most ‘truthful’ of these responses, ChatGPT poses hidden risks to academic, the researcher say.

“The importance of proper referencing in science is undeniable. The quality and breadth of the references provided in authentic studies demonstrate that the researchers have performed a complete literature review and are knowledgeable about the topic. This process enables the integration of findings in the context of previous work, a fundamental aspect of medical research advancement. Failing to provide references is one thing but creating fake references would be considered fraudulent for researchers,” says Dr Esli Osmanlliu, emergency physician at the Montreal Children’s Hospital and scientist with the Child Health and Human Development Program at the Research Institute of the McGill University Health Centre.

“Researchers using ChatGPT may be misled by false information because clear, seemingly coherent and stylistically appealing references can conceal poor content quality,” adds Dr Osmanlliu.

This is the first study to assess the quality and accuracy of references provided by ChatGPT, the researchers point out.

Source: McGill University Health Centre

Dr Robot Will See You Now: Medical Chatbots Need to be Regulated

Photo by Alex Knight on Unsplash

The Large Language Models (LLM) used in chatbots may appear to offer reliable, persuasive advice in a format which mimics conversation but in they can offer potentially harmful information when prompted with medical questions. Therefore, any LLM-chatbot in a medical setting would require approval as a medical device, argue experts in a paper published in Nature Medicine.

The mistake often made with LLM-chatbots is that they are a true “artificial intelligence” when in fact they are more closely related to the predictive text in a smartphone. They mostly use conversations and text scraped from the internet, and use algorithms to associate words and sentences in a manner that appears meaningful.

“Large Language Models are neural network language models with remarkable conversational skills. They generate human-like responses and engage in interactive conversations. However, they often generate highly convincing statements that are verifiably wrong or provide inappropriate responses. Today there is no way to be certain about the quality, evidence level, or consistency of clinical information or supporting evidence for any response. These chatbots are unsafe tools when it comes to medical advice and it is necessary to develop new frameworks that ensure patient safety,” said Prof Stephen Gilbert at TU Dresden.

Challenges in the regulatory approval of LLMs

Most people research their symptoms online before seeking medical advice. Search engines play a role in decision-making process. The forthcoming integration of LLM-chatbots into search engines may increase users’ confidence in the answers given by a chatbot that mimics conversation. It has been demonstrated that LLMs can provide profoundly dangerous information when prompted with medical questions.

The basis of LLMs do not have any medical “ground truth,” which is inherently dangerous. Chat-interfaced LLMs have already provided harmful medical responses and have already been used unethically in ‘experiments’ on patients without consent. Almost every medical LLM use case requires regulatory control in the EU and US. In the US their lack of explainability disqualifies them from being ‘non devices’. LLMs with explainability, low bias, predictability, correctness, and verifiable outputs do not currently exist and they are not exempted from current (or future) governance approaches.

The authors describe in their paper the limited scenarios in which LLMs could find application under current frameworks. They also describe how developers can seek to create LLM-based tools that could be approved as medical devices, and they explore the development of new frameworks that preserve patient safety. “Current LLM-chatbots do not meet key principles for AI in healthcare, like bias control, explainability, systems of oversight, validation and transparency. To earn their place in medical armamentarium, chatbots must be designed for better accuracy, with safety and clinical efficacy demonstrated and approved by regulators,” concludes Prof Gilbert.

Source: Technische Universität Dresden

ChatGPT can Now (Almost) Pass the US Medical Licensing Exam

Photo by Maximalfocus on Unsplash

ChatGPT can score at or around the approximately 60% pass mark for the United States Medical Licensing Exam (USMLE), with responses that make coherent, internal sense and contain frequent insights, according to a study published in PLOS Digital Health by Tiffany Kung, Victor Tseng, and colleagues at AnsibleHealth.

ChatGPT is a new artificial intelligence (AI) system, known as a large language model (LLM), designed to generate human-like writing by predicting upcoming word sequences. Unlike most chatbots, ChatGPT cannot search the internet. Instead, it generates text using word relationships predicted by its internal processes.

Kung and colleagues tested ChatGPT’s performance on the USMLE, a highly standardised and regulated series of three exams (Steps 1, 2CK, and 3) required for medical licensure in the United States. Taken by medical students and physicians-in-training, the USMLE assesses knowledge spanning most medical disciplines, ranging from biochemistry, to diagnostic reasoning, to bioethics.

After screening to remove image-based questions, the authors tested the software on 350 of the 376 public questions available from the June 2022 USMLE release. 

After indeterminate responses were removed, ChatGPT scored between 52.4% and 75.0% across the three USMLE exams. The passing threshold each year is approximately 60%. ChatGPT also demonstrated 94.6% concordance across all its responses and produced at least one significant insight (something that was new, non-obvious, and clinically valid) for 88.9% of its responses. Notably, ChatGPT exceeded the performance of PubMedGPT, a counterpart model trained exclusively on biomedical domain literature, which scored 50.8% on an older dataset of USMLE-style questions.

While the relatively small input size restricted the depth and range of analyses, the authors note their findings provide a glimpse of ChatGPT’s potential to enhance medical education, and eventually, clinical practice. For example, they add, clinicians at AnsibleHealth already use ChatGPT to rewrite jargon-heavy reports for easier patient comprehension.

“Reaching the passing score for this notoriously difficult expert exam, and doing so without any human reinforcement, marks a notable milestone in clinical AI maturation,” say the authors.

Author Dr Tiffany Kung added that ChatGPT’s role in this research went beyond being the study subject: “ChatGPT contributed substantially to the writing of [our] manuscript… We interacted with ChatGPT much like a colleague, asking it to synthesise, simplify, and offer counterpoints to drafts in progress…All of the co-authors valued ChatGPT’s input.”

Source: EurekAlert!