ChatGPT, Claude startup Health Advice. Should you trust them?: STAT News

Chatbots could expand access for some users, but the tools aren’t yet fully validated for consumer health questions

OpenAI and Anthropic are launching health-focused versions of their chatbots, ChatGPT and Claude, allowing users to upload medical records and receive health advice.

While these tools could expand access to healthcare information, experts warn of potential risks, as there isn’t yet evidence proving their accuracy and some studies indicate a likelihood of harm. Despite disclaimers, there is concern that users may over-trust these AI models, leading to potential misdiagnoses or harmful advice.

Admin: One might suggest here that over time the technology is only going to continue to get better at diagnostics, and that it is plausible that at some point in the not too distant future the AI diagnostics would be on a par with a visit to the local clinic. In some instances, busy doctors can overlook things and get it “wrong” too. Medicine is not a perfect undertaking, and access to the best and most current medical thinking in the world, if that can be made available someday through AI, would be priceless.

Brittany Trang

STAT NEWS. Jan. 12, 2026

LAS VEGAS — It took 12 doctors at six health systems across four states to diagnose top Trump administration official Amy Gleason’s daughter with a rare immune disease.

But that was before artificial intelligence. As she’s explained over and over while promoting the administration’s efforts to make “health technology great again,” it doesn’t have to be that way today.

“I truly believe that if she had been diagnosed now instead of in 2010, AI could have picked up what she had way faster than a year and three months it took,” Gleason, the acting administrator of the U.S. DOGE Service and a strategic adviser to the Centers for Medicare and Medicaid Services, told attendees at the Consumer Electronics Show in Las Vegas last week.

Gleason’s daughter Morgan, now 27, recently used ChatGPT to try to find a clinical trial because her complication of ulcerative colitis disqualified her from the one she wanted to enter. But ChatGPT told her that it didn’t think she had ulcerative colitis. Instead, it told her she had microscopic lymphocytic colitis — which turned out to be correct. “I think she’s going to get into the trial, all because AI helped her figure that out,” Gleason said.

This is the kind of patient empowerment commercial AI model makers like OpenAI and Anthropic want to facilitate. Last week, OpenAI announced ChatGPT Health, a corner of ChatGPT with enhanced data security where users can upload their medical records or hook up the data feed from their wellness apps. On Sunday, Anthropic rolled out a similar feature for its chatbot, Claude, including allowing users to import health records.

These announcements fulfill OpenAI and Anthropic’s pledges to Medicare and Gleason that they would make health AI assistants for patients. But encouraging and enabling patients to ask large language models health questions comes with risks. While these new AI-based health advice products expand access for patients seeking more advice than the health care system can provide, experts say there’s no proof that these AI models can actually give good answers to health questions and studies indicate that there is a likely potential for harm.

OpenAI says that as of January 2026, more than 40 million people per day are asking ChatGPT their health questions, and the explicit encouragement to get personalized health advice from ChatGPT may only deepen that sense of trust. But OpenAI and other model makers are formally entering the health advice space when they are already facing high-profile lawsuits alleging that their chatbots caused harm or even death. They have also been criticized by lawmakers and academic experts for not doing enough to prevent these alleged harms.

Even though OpenAI’s new offering includes a disclaimer that ChatGPT Health “is not intended for diagnosis or treatment,” Stanford University AI Research and Science Evaluation Network Executive Director Ethan Goh says the company is straddling a fine line in trying to have it both ways. “Patients are going to over-trust it, because that’s purely what it’s been designed and intended for — to drive engagement,” he said.

It’s a careful balance. For patients who can’t access a doctor — either because of the time of day, their ability to pay, or the shortage of primary care physicians in the U.S. — having something rather than nothing can be a godsend.

For a parent making the choice between paying for an emergency room visit and making rent, they’re looking at their sick kid and wondering, “What the heck am I going to do,” said Jennifer Goldsack, CEO of the Digital Medicine Society, a nonprofit promoting the use of technology in addressing health problems. She thinks that’s maybe where it’s appropriate to trade off trust for access. Getting a second opinion that confirms that the situation is bad enough to need an ER visit, versus advice that says the situation can probably wait until urgent care opens in the morning, is a big deal.

But stories like that of Sam Nelson, the California teen who died last May after asking ChatGPT to help him get high safely, and the University of Washington patient who began ingesting a poisonous salt after consulting ChatGPT, cast doubt on large language models’ ability to safely guide patients in this way. A December preprint from leading health AI researchers who tested 31 leading LLMs, including those from OpenAI and Anthropic, found that potential for severe harm occurred in 22% of cases they tested.

The American College of Cardiology’s Chief Innovation Officer Ami Bhatt, who also chairs the FDA’s Digital Health Advisory Committee, told STAT that it’s important to give patients access to information about their health, but that it’s also imperative to educate the average consumer about what AI can and can’t do for their health. “Don’t give someone a tool and say, ‘I am now absolved of responsibility for this tool,’” she said. She also wants AI companies to work with medical professionals to improve patient outcomes, not just drop products and move on. “I want to see their plan for seeking out clinician and patient responses in terms of hard outcomes and metrics after they release their model,” she said, which would also facilitate efforts from medical societies like hers in educating people on their most-asked health questions.

In ChatGPT, the health functions will soon be accessible from a specific “Health” tab. The main differences, according to OpenAI, are that the new health space will allow users to upload dedicated health documents with enhanced privacy protections. The company has promised not to train its foundation models on anything in ChatGPT Health. But that’s about all that’s different between regular ChatGPT and the health version.

Similarly, an Anthropic spokesperson told STAT that there isn’t a “separate ‘health-tuned’ model” for the health features of Claude, though the model has generally been designed to give instructions to go to the emergency room or call an emergency number for queries that mention symptoms like chest pain, difficulty breathing, signs of stroke, severe bleeding, or loss of consciousness. The model is also designed to acknowledge uncertainty and give disclaimers, the spokesperson said, though a 2025 study showed that Claude 3.7 Sonnet only included disclaimers such as “I am an AI” and “I am not qualified to give medical advice” in 1.8% of health queries.

When STAT asked OpenAI to clarify what improvements were made to the model to make it better at answering health questions, an OpenAI spokesperson said that the model has been outfitted with tools and guardrails that will help deliver a more personalized health experience for users.

OpenAI health lead Karan Singhal pointed to overall model improvements over time from GPT-4o to various iterations of GPT-5. “We kind of now feel like we have the ability to kind of understand how to train model behavior and the models so that they represent clinician judgment,” he said. As proof, Singhal pointed to OpenAI’s work on HealthBench, a health questions testing set that OpenAI created last year to score its own models, and a study of a GPT-based clinician co-pilot by Penda Health in Kenya.

But experts said that HealthBench does not measure the same kinds of tasks that users would typically use ChatGPT Health for. Michael Turken, a practicing physician and the creator of My Doctor Friend, an app that uses AI to answer health questions, noted that HealthBench primarily evaluates only very short conversations — an average of 2.6 turns per interaction. But that’s not how patients uploading data will interact with the app, he said.

“From watching how people interact with My Doctor Friend and talking to ChatGPT users, it’s quite common for people to have back-and-forth conversations stretching over hundreds, sometimes over a thousand turns, about health issues,” he said. OpenAI itself has previously said that even though models behave well in short conversations, they start breaking down and ignoring safety instructions in long conversations, though the company has noted improvement with newer models.

HealthBench is also only text-based and didn’t test how the model does with multimodal inputs, such as images of moles, uploaded scans, reports, and other kinds of data, said Goh. “There are so many more ways that this could screw up compared to just text-based [inputs],” he said.

The Penda Health study Singhal referenced tested an OpenAI-powered tool called AI Consult in a primary care practice in Kenya. OpenAI highlighted that the pilot, which included almost 40,000 cases, resulted in reducing diagnostic errors by 16% and treatment errors by 13% relative to doctors not using the tool. But Eyal Klang, who teaches AI and human health at Mount Sinai’s Icahn School of Medicine, said that reporting average performance in studies misses the important stuff because harm — such as “missed red flags, wrong triage, and copy-forward errors from a single bad detail,” he said — concentrate in the small percentage of errors.

The Penda Health results also showed that the LLM suggested actively harmful recommendations in 7.8% of cases, and in almost 60% of those cases, the clinician appeared to have adopted, or partially adopted the harmful advice, the authors reported.

Anthropic in its announcement included Claude’s scores on MedAgentBench and MedCalc, two benchmarks developed by groups of academics. They measure LLM agents’ performance on tasks like ordering tests or answering questions about a patient’s chart, and how well LLMs can calculate medical metrics, respectively — both tasks for which doctors, not patients, might use LLMs.

“I don’t feel that OpenAI or Anthropic have demonstrated in the literature that their models are ready for general public assessment of medical records and data. The benchmarks they’ve used are not fully representative of what patients may ask,” said Roxana Daneshjou, an assistant professor of biomedical data science and dermatology at Stanford. She pointed out that benchmarks like Stanford’s MedHELM test for the actual tasks an LLM is asked to complete in a given situation, not just general medical knowledge.

Other academic studies like those by Klang show that LLMs’ medical advice can be swayed by certain factors for the same patient, raising questions about the quality of LLMs’ advice. For example, the default settings for LLMs — their training and how their developers adjust thresholds for refusing to answer a question, risk tolerance, and procedures for escalating cases — dictate what an LLM recommends, even when the facts of the case stay the same. In other studies, he’s shown that LLMs will treat different cases differently depending on what the models were told about the patients’ sociodemographics, a behavior that carries over to GPT-5. He’s worried that when just a handful of LLMs become the first stop for people’s health journeys, harmful patterns will emerge in the LLMs’ medical blind spots as their “treatment” scales over millions of users.

Goh agreed that there will be new types of harm from AI-generated medical advice, similar to the emergence of “AI psychosis,” but said that there’s no way to find out what those harms might be until after people start using these new commercial LLM health products.

“Everyone hopes that when it does come up, they don’t spoil it for everyone else,” said Goh.

Mario Aguilar contributed reporting.

ChatGPT, Claude startup Health Advice. Should you trust them?: STAT News

Chatbots could expand access for some users, but the tools aren’t yet fully validated for consumer health questions

Submit a Comment Cancel reply

Archives