Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn’t ready to take on the role of the physician.”

“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    1
    ·
    edit-2
    3 hours ago

    So, I can speak to this a little bit, as it touches two domains I’m involved it. TL;DR - LLMs bullshit and are unreliable, but there’s a way to use them in this domain as a force multiplier of sorts.

    In one; I’ve created a python router that takes my (deidentified) clinical notes, extracts and compacts input (user defined rules), creates a summary, then -

    1. benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).

    2. this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.

    Additionally, the llm generated note can be approved / denied by the python router, in the first instance, based on certain policy criteria I’ve defined.

    It can also suggest probable DDX based on my database (which are .CSV based)

    Finally, if the llm output fails policy check, the router tells me why it failed and just says “go look at the prior summary and edit it yourself”.

    This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing, which is approx a 5-7x speed up.

    The reason why this is interesting:

    All of this runs within the llm (or more accurately, it’s invoked from within the llm. It calls / invokes the python tooling via >> commands, which live outside the LLMs purview) but is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.

    Ive found that using a fairly “dumb” llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (5 out of 6 are graded as passed on first run by router invoking policy document and checking output). It’s too dumb to jazz, which is useful in this instance.

    Would I trust the LLM, end to end? Well, I’d trust my system, approx 80% of the time. I wouldn’t trust ChatGPT … even though its been more right than wrong in similar tests.

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        5 hours ago

        Depends which bit you mean specifically.

        The “router” side is a offshoot of a personal project. It’s python scripting and a few other tricks, such as JSON files etc. Full project details for that here

        https://github.com/BobbyLLM/llama-conductor

        The tech stack itself:

        • llama.cpp
        • Qwen 2.5-1.5 GGUF base (by memory, 5 bit quant from HF Alibaba repository)
        • The python router (more sophisticated version of above)
        • Policy documents
        • Front end (OWUI - may migrate to something simpler / more robust. Occasional streaming disconnect issues at moment. Annoying but not terminal)
        • realitista@lemmus.org
          link
          fedilink
          English
          arrow-up
          1
          ·
          4 hours ago

          Thanks it’s really interesting to see some real work applications and implementations of AI for practical workloads.

          • SuspciousCarrot78@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            4 hours ago

            Very welcome :)

            As it usually goes with these things, I built it for myself then realised it might have actual broader utility. We shall see!