Chatbots Make Terrible Doctors, New Study Finds

XLE@piefed.social · edit-2 19 hours ago

Chatbots Make Terrible Doctors, New Study Finds

Digit@lemmy.wtf · 1 hour ago

Terrible programmers, psychologists, friends, designers, musicians, poets, copywriters, mathematicians, physicists, philosophers, etc too.

Though to be fair, doctors generally make terrible doctors too.

hector@lemmy.today · 16 minutes ago

Also bad lawyers. And lawyers also make terrible lawyers to be fair.

stressballs@lemmy.zip · edit-2 1 hour ago

This was my thought. The weird inconsistent diagnoses, and sending people to the emergency room for nothing, while another day dismissing serious things has been exactly my experience with doctors over and over again.

You need doctors and a Chatbot, and lots of luck.

Shanmugha@lemmy.world · 3 hours ago

No shit, Sherlock :)

rumba@lemmy.zip · 5 hours ago

Chatbots make terrible everything.

But an LLM properly trained on sufficient patient data metrics and outcomes in the hands of a decent doctor can cut through bias, catch things that might fall through the cracks and pack thousands of doctors worth of updated CME into a thing that can look at a case and go, you know, you might want to check for X. The right model can be fucking clutch at pointing out nearly invisible abnormalities on an xray.

You can’t ask an LLM trained on general bullshit to help you diagnose anything. You’ll end up with 32,000 Reddit posts worth of incompetence.

XLE@piefed.social · edit-2 41 minutes ago

But an LLM properly trained on sufficient patient data metrics and outcomes in the hands of a decent doctor can cut through bias

The belief AI is unbiased is a common myth. In fact, it can easily covertly import existing biases, like systemic racism in treatment recommendations.
Even AI engineers who developed the training process could not tell you where the bias in an existing model would be.
AI has been shown to make doctors worse at their jobs. The doctors who need to provide training data.
Even if 1, 2, and 3 were all false, we all know AI would be used to replace doctors and not supplement them.

hector@lemmy.today · edit-2 6 minutes ago

Not only is their bias inherent in the system, it’s seemingly impossible to keep out. For decades, from the genesis of chatbots, they’ve had every single one immediately become bigoted when they let it off the leash. All previous chatbot previously released seemingly were almost immediately recalled as they all learned to be bigoted.

That is before this administration leaned on the AI providers to make sure the AI isn’t “Woke.” I would bet it was already an issue that the makers of chatbots and machine learning are already hostile to any sort of leftism, or do gooderism, that naturally threatens the outsized share of the economy and power the rich have made for themselves by virtue of owning stock in companies. I am willing to bet they already interfered to make the bias worse because of those natural inclinations to avoid a bot arguing for socializing medicine and the like. An inescapable conclusion any reasoned being would come to being the only answer to that question if the conversation were honest.

So maybe that is part of why these chatbots have always been bigoted right from the start, but the other part is they will become mecha hitler if left to learn in no time at all, and then worse.

Ricaz@lemmy.dbzer0.com · 2 hours ago

Just sharing my personal experience with this:

I used Gemini multiple times and it worked great. I have some weird symptoms that I described to Gemini, and it came up with a few possibilities, most likely being “Superior Canal Dehiscence Syndrome”.

My doctor had never heard of it, and only through showing them the articles Gemini linked as sources, would my doctor even consider allowing a CT scan.

Turns out Gemini was right.

cøre@leminal.space · 2 hours ago

They have to be for a specialized type of treatment or procedure such as looking at patient xrays or other scans. Just slopping PHI into a LLM and expecting it to diagnose random patient issues is what gives the false diagnoses.

SuspciousCarrot78@lemmy.world · edit-2 2 hours ago

Agree.

I’m sorta kicking myself I didn’t sign up for Google’s MedPALM-2 when I had the chance. Last I checked, it passed the USMLE exam with 96% and 88% on radio interpretation / report writing.

I remember looking at the sign up and seeing it requested credit card details to verify identity (I didn’t have a google account at the time). I bounced… but gotta admit, it might have been fun to play with.

Oh well; one door closes another opens.

In any case, I believe this article confirms GIGO. The LLMs appear to have been vastly more accurate when fed correct inputs by clinicians versus what lay people fed it.

alzjim@lemmy.world · 8 hours ago

Calling chatbots “terrible doctors” misses what actually makes a good GP — accessibility, consistency, pattern recognition, and prevention — not just physical exams. AI shines here — it’s available 24/7 🕒, never rushed or dismissive, asks structured follow-up questions, and reliably applies up-to-date guidelines without fatigue. It’s excellent at triage — spotting red flags early 🚩, monitoring symptoms over time, and knowing when to escalate to a human clinician — which is exactly where many real-world failures happen. AI shouldn’t replace hands-on care — and no serious advocate claims it should — but as a first-line GP focused on education, reassurance, and early detection, it can already reduce errors, widen access, and ease overloaded systems — which is a win for patients 💙 and doctors alike.

/s

plyth@feddit.org · 4 hours ago

The /s was needed for me. There are already more old people than the available doctors can handle. Instead of having nothing what’s wrong with an AI baseline?

BaroqueW@lemmy.world · 5 hours ago

ngl you got me in the first half there

SuspciousCarrot78@lemmy.world · edit-2 1 hour ago

So, I can speak to this a little bit, as it touches two domains I’m involved it. TL;DR - LLMs bullshit and are unreliable, but there’s a way to use them in this domain as a force multiplier of sorts.

In one; I’ve created a python router that takes my (deidentified) clinical notes, extracts and compacts input (user defined rules), creates a summary, then -

benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).
this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.

Additionally, the llm generated note can be approved / denied by the python router, in the first instance, based on certain policy criteria I’ve defined.

It can also suggest probable DDX based on my database (which are .CSV based)

Finally, if the llm output fails policy check, the router tells me why it failed and just says “go look at the prior summary and edit it yourself”.

This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing, which is approx a 5-7x speed up.

The reason why this is interesting:

All of this runs within the llm (or more accurately, it’s invoked from within the llm. It calls / invokes the python tooling via >> commands, which live outside the LLMs purview) but is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.

Ive found that using a fairly “dumb” llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (5 out of 6 are graded as passed on first run by router invoking policy document and checking output). It’s too dumb to jazz, which is useful in this instance.

Would I trust the LLM, end to end? Well, I’d trust my system, approx 80% of the time. I wouldn’t trust ChatGPT … even though its been more right than wrong in similar tests.

realitista@lemmus.org · 4 hours ago

Interesting. What technology are you using for this pipeline?

SuspciousCarrot78@lemmy.world · edit-2 4 hours ago

Depends which bit you mean specifically.

The “router” side is a offshoot of a personal project. It’s python scripting and a few other tricks, such as JSON files etc. Full project details for that here

https://github.com/BobbyLLM/llama-conductor

The tech stack itself:

llama.cpp
Qwen 2.5-1.5 GGUF base (by memory, 5 bit quant from HF Alibaba repository)
The python router (more sophisticated version of above)
Policy documents
Front end (OWUI - may migrate to something simpler / more robust. Occasional streaming disconnect issues at moment. Annoying but not terminal)

realitista@lemmus.org · 3 hours ago

Thanks it’s really interesting to see some real work applications and implementations of AI for practical workloads.

SuspciousCarrot78@lemmy.world · 3 hours ago

Very welcome :)

As it usually goes with these things, I built it for myself then realised it might have actual broader utility. We shall see!

Etterra@discuss.online · 8 hours ago

I didn’t need a study to tell me not to listen to a hallucinating parrot-bot.

pleksi@sopuli.xyz · 8 hours ago

As a phycisian ive used AI to check if i have missed anything in my train of thought. Never really changed my decision though. Has been useful to hather up relevant sitations for my presentations as well. But that’s about it. It’s truly shite at interpreting scientific research data on its own for example. Most of the time it will parrot the conclusions of the authors.

Tollana1234567@lemmy.today · edit-2 7 hours ago

its basically a convoluted version of webmd. even MD mods in medical subs are more accurate.

Digit@lemmy.wtf · 1 hour ago

It’s scary, when someone recommends webmd as a primary, and reliable, source of healthcare information.

Presumably those same people would unquestioningly take the first thing an LLM says as gospel too.

BeigeAgenda@lemmy.ca · 18 hours ago

Anyone who have knowledge about a specific subject says the same: LLM’S are constantly incorrect and hallucinate.

Everyone else thinks it looks right.

tyler@programming.dev · 5 hours ago

That’s not what the study showed though. The LLMs were right over 98% of the time…when given the full situation by a “doctor”. It was normal people who didn’t know what was important that were trying to self diagnose that were the problem.

Hence why studies are incredibly important. Even with the text of the study right in front of you, you assumed something that the study did not come to the same conclusion of.

IratePirate@feddit.org · edit-2 17 hours ago

A talk on LLMs I was listening to recently put it this way:

If we hear the words of a five-year-old, we assume the knowledge of a five-year-old behind those words, and treat the content with due suspicion.

We’re not adapted to something with the “mind” of a five-year-old speaking to us in the words of a fifty-year-old, and thus are more likely to assume competence just based on language.

leftzero@lemmy.dbzer0.com · 12 hours ago

LLMs don’t have the mind of a five year old, though.

They don’t have a mind at all.

They simply string words together according to statistical likelihood, without having any notion of what the words mean, or what words or meaning are; they don’t have any mechanism with which to have a notion.

They aren’t any more intelligent than old Markov chains (or than your average rock), they’re simply better at producing random text that looks like it could have been written by a human.

plyth@feddit.org · 4 hours ago

They simply string words together according to statistical likelihood, without having any notion of what the words mean

What gives you the confidence that you don’t do the same?

Digit@lemmy.wtf · 1 hour ago

human: je pense

llm: je ponce

IratePirate@feddit.org · 8 hours ago

I am aware of that, hence the ""s. But you’re correct, that’s where the analogy breaks. Personally, I prefer to liken them to parrots, mindlessly reciting patterns they’ve found in somebody else’s speech.

agentTeiko@piefed.social · 13 hours ago

Yep its why CLevels think its the Holy Grail they don’t see it as everything that comes out of their mouth is bullshit as well. So they don’t see the difference.

zewm@lemmy.world · 17 hours ago

It is insane to me how anyone can trust LLMs when their information is incorrect 90% of the time.

SuspciousCarrot78@lemmy.world · edit-2 1 hour ago

I don’t think it’s their information per se, so much as how the LLMs tend to use said information.

LLMs are generally tuned to be expressive and lively. A part of that involves “random” (ie: roll the dice) output based on inputs + training data. (I’m skipping over technical details here for sake of simplicity)

That’s what the masses have shown they want - friendly, confident sounding, chat bots, that can give plausible answers that are mostly right, sometimes.

But for certain domains (like med) that shit gets people killed.

TL;DR: they’re made for chitchat engagement, not high fidelity expert systems. You have to pay $$$$ to access those.

irate944@piefed.social · 19 hours ago

I could’ve told you that for free, no need for a study

rudyharrelson@lemmy.radio · edit-2 19 hours ago

People always say this on stories about “obvious” findings, but it’s important to have verifiable studies to cite in arguments for policy, law, etc. It’s kinda sad that it’s needed, but formal investigations are a big step up from just saying, “I’m pretty sure this technology is bullshit.”

I don’t need a formal study to tell me that drinking 12 cans of soda a day is bad for my health. But a study that’s been replicated by multiple independent groups makes it way easier to argue to a committee.

irate944@piefed.social · 19 hours ago

Yeah you’re right, I was just making a joke.

But it does create some silly situations like you said

rudyharrelson@lemmy.radio · 19 hours ago

I figured you were just being funny, but I’m feeling talkative today, lol

IratePirate@feddit.org · 17 hours ago

A critical, yet respectful and understanding exchange between two individuals on the interwebz? Boy, maybe not all is lost…

Knot@lemmy.zip · 18 hours ago

I get that this thread started from a joke, but I think it’s also important to note that no matter how obvious some things may seem to some people, the exact opposite will seem obvious to many others. Without evidence, like the study, both groups are really just stating their opinions

It’s also why the formal investigations are required. And whenever policies and laws are made based on verifiable studies rather than people’s hunches, it’s not sad, it’s a good thing!

BillyClark@piefed.social · 18 hours ago

it’s important to have verifiable studies to cite in arguments for policy, law, etc.

It’s also important to have for its own merit. Sometimes, people have strong intuitions about “obvious” things, and they’re completely wrong. Without science studying things, it’s “obvious” that the sun goes around the Earth, for example.

I don’t need a formal study to tell me that drinking 12 cans of soda a day is bad for my health.

Without those studies, you cannot know whether it’s bad for your health. You can assume it’s bad for your health. You can believe it’s bad for your health. But you cannot know. These aren’t bad assumptions or harmful beliefs, by the way. But the thing is, you simply cannot know without testing.

Slashme@lemmy.world · 8 hours ago

Or how bad something is. “I don’t need a scientific study to tell me that looking at my phone before bed will make me sleep badly”, but the studies actually show that the effect is statistically robust but small.

In the same way, studies like this can make the distinction between different levels of advice and warning.

SuspciousCarrot78@lemmy.world · 2 hours ago

I remember discussing / doing critical appraisal of this. Turns out it was less about the phone and more about the emotional dysregulation / emotional arousal causing delay in sleep onset.

So yes, agree, we need studies, and we need to know how to read them and think over them together.

Telorand@reddthat.com · 18 hours ago

The thing that frustrates me about these studies is that they all continue to come to the same conclusions. AI has already been studied in mental health settings, and it’s always performed horribly (except for very specific uses with professional oversight and intervention).

I agree that the studies are necessary to inform policy, but at what point are lawmakers going to actually lay down the law and say, “AI clearly doesn’t belong here until you can prove otherwise”? It feels like they’re hemming and hawwing in the vain hope that it will live up to the hype.

Eager Eagle@lemmy.world · edit-2 19 hours ago

Also, it’s useful to know how, when, or why something happens. I can make a useless chatbot that is “right” most times if it only tells people to seek medical help.

hansolo@lemmy.today · 18 hours ago

I’m going to start telling people I’m getting a Master’s degree in showing how AI is bullshit. Then I point out some AI slop and mumble about crushing student loan debt.

spaghettiwestern@sh.itjust.works · edit-2 15 hours ago

Most doctors make terrible doctors.

Sektor@lemmy.world · 7 hours ago

But the good ones are worth a monument in the place they worked.

sbbq@lemmy.zip · 14 hours ago

My dad always said, you know what they call the guy who graduated last in his class at med school? Doctor.

pageflight@piefed.social · 17 hours ago

Chatbots are terrible at anything but casual chatter, humanity finds.

Treczoks@lemmy.world · 14 hours ago

One needs a study for that?

Sterile_Technique@lemmy.world · edit-2 17 hours ago

Chipmunks, 5 year olds, salt/pepper shakers, and paint thinner, also all make terrible doctors.

Follow me for more studies on ‘shit you already know because it’s self-evident immediately upon observation’.

kescusay@lemmy.world · 16 hours ago

I would like to subscribe to your newsletter.

theunknownmuncher@lemmy.world · 19 hours ago

A statistical model of language isn’t the same as medical training???

scarabic@lemmy.world · edit-2 14 hours ago

It’s actually interesting. They found the LLMs gave the correct diagnosis high-90-something percent of the time if they had access to the notes doctors wrote about their symptoms. But when thrust into the room, cold, with patients, the LLMs couldn’t gather that symptom info themselves.

SuspciousCarrot78@lemmy.world · edit-2 44 minutes ago

Funny how the hivemind over looks that bit enroute to stunt on LLMs.

If anything, that 90% result supports the idea that Garbage In = Garbage Out. I imagine a properly used domain-tuned medical model with structured inputs could exceed those results in some diagnostic settings (task-dependent).

Iirc, the 2024 Nobel prize in chemistry was won on the basis of using ML expert system to investigate protein folding. ML =! LLM but at the same time, let’s not throw the baby out with the bathwater.

EDIT: for the lulz, I posted my above comment in my locally hosted bespoke llm. It politely called my bullshit out (Alpha fold is technically not an expert system, I didn’t cite my source for Med-Palm 2 claims). Not all hope is lost with these things lol

The statement contains a mix of plausible claims and minor logical inconsistencies. The core idea—that expert systems using ML can outperform simple LLMs in specific tasks—is reasonable.

However, the claim that “a properly used expert system LLM (Med-PALM-2) is even better than 90% accurate in differentials” is unsupported by the provided context and overreaches from the general “Garbage In = Garbage Out” principle.

Additionally, the assertion that the 2024 Nobel Prize in Chemistry was won “on the basis of using ML expert system to investigate protein folding” is factually incorrect; the prize was awarded for AI-assisted protein folding prediction, not an ML expert system per se.

Confidence: medium | Source: Mixed

Hacksaw@lemmy.ca · 13 hours ago

LLM gives correct answer when doctor writes it down first… Wowoweewow very nice!

tyler@programming.dev · 5 hours ago

You have misunderstood what they said.

Hacksaw@lemmy.ca · 1 hour ago

If you seriously think the doctor’s notes about the patient’s symptoms don’t include the doctor’s diagnostic instincts then I can’t help you.

The symptom questions ARE the diagnostic work. Your doctor doesn’t ask you every possible question. You show up and you say “my stomach hurts”. The Doctor asks questions to rule things out until there is only one likely diagnosis then they stop and prescribe you a solution if available. They don’t just ask a random set of questions. If you give the AI the notes JUST BEFORE the diagnosis and treatment it’s completely trivial to diagnose because the diagnostic work is already complete.

God you AI people literally don’t even understand what skill, craft, trade, and art are and you think you can emulate them with a text predictor.