[HN Gopher] HealthBench
       ___________________________________________________________________
        
       HealthBench
        
       Author : mfiguiere
       Score  : 111 points
       Date   : 2025-05-12 17:42 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | Zaheer wrote:
       | Impressive how well Grok performs in these tests. Grok feels
       | 'underrated' in terms of how much other models (gemini, llama,
       | etc) are in the news.
        
         | tough wrote:
         | you can't download grok's weights to run locally
        
           | simianwords wrote:
           | how is that relevant here?
        
             | tough wrote:
             | it helps explain why theres' less people talking about them
             | than gemini or llama?
             | 
             | less people using them.
        
               | Insanity wrote:
               | I can guarantee you none of my friends (not in tech) use
               | "downloading weights" as an input to select an LLM
               | application.
        
               | simianwords wrote:
               | isn't chatgpt the most used or most popular model?
        
               | tough wrote:
               | Yes OpenAI has a first-mover advantage and Claude seems
               | to be close as a second player with their closed models
               | too, open weights is not a requirement for success but in
               | an already crowded market (grok's prospect) their
               | preposition isn't competing neither with top tier closed
               | models nor the maybe lesser-capable but more-available
               | battle-tested freely available to run locally open ones
        
               | reissbaker wrote:
               | You can't download Gemini's weights either, so it's not
               | relevant as a comparison against Gemini.
               | 
               | I think the actually-relevant issue here is that until
               | last month there wasn't API access for Grok 3, so no one
               | could test or benchmark it, and you couldn't integrate it
               | into tools that you might want to use it with. They only
               | allowed Grok 2 in their API, and Grok 2 was a pretty bad
               | model.
        
               | tough wrote:
               | lol sorry mixed them up w gemma3 which feels like the
               | open lesser cousin to gemini 2.5/2.0 models
        
             | moralestapia wrote:
             | It's not.
             | 
             | Also, only one out of the ten models benchmarked have open
             | weights, so I'm not sure what GP is arguing for.
        
               | tough wrote:
               | > in terms of how much other models (gemini, llama, etc)
               | are in the news.
               | 
               | not talking about TFA or benchmarks but the news
               | coverage/user sentiment ...
        
       | ramon156 wrote:
       | I don't want to be a conspiracy theorist, but could this be in
       | preparation for Amazon's (to be) health branch?
        
       | srameshc wrote:
       | Is the Med-PaLM model that Google's has been working on meant to
       | be considered for comparison ? If I'm not mistaken, it isn't
       | publicly available.
       | 
       | > https://sites.research.google/med-palm/
        
         | aix1 wrote:
         | Med-PaLM is old and has been superseded by (multiple
         | generations of) Gemini.
        
       | GuinansEyebrows wrote:
       | i have zero trust in openai's ability to do anything impartially.
       | why should we leave the judgement of a private tool up to the
       | makers of the tool especially when human lives are at stake?
        
         | simianwords wrote:
         | I agree - we should exercise a bit of caution here. There is no
         | way they would release a benchmark which makes their model look
         | bad. But then again we know that their models are one of the
         | best for other uses so its not a big leap to accept this
         | benchmark.
        
         | beezlebroxxxxxx wrote:
         | I can already see the pharma salesmen drooling at the idea of
         | how various symptoms can be marketed to.
         | 
         | "As an AI agent, a possible diagnosis is [xxx]. Ask your doctor
         | about / look into [yyy(tm)] for a possible solution!"
        
           | ceejayoz wrote:
           | And OpenAI is definitely thinking about this on their end:
           | 
           | https://www.axios.com/2024/12/03/openai-ads-chatgpt
           | 
           | > OpenAI CFO Sarah Friar told the Financial Times that OpenAI
           | is weighing the inclusion of ads in its products but wants to
           | be "thoughtful about when and where we implement them."
        
             | barnas2 wrote:
             | Ad spots inside chatgpt are going to be worth an obscene
             | amount of money.
        
         | amarcheschi wrote:
         | I think that the damage of "chatgpt misdiagnose X as Y, person
         | dies of Z" would be quite bad for PR
        
       | dcreater wrote:
       | Isn't there an obvious conflict of interest when the model maker
       | is also the creator of a benchmark? I think at the very least it
       | should be from a separate business entity under the non profit or
       | from the non profit holding entity itself
        
         | gwd wrote:
         | I don't think it's necessarily bad to have the benchmark, but
         | the graphs of Gemini and Claude doing worse than o3 did kind of
         | leave a bad taste in my mouth. "Oh look, your models are worse
         | than ours at this very important metric that we just made up!
         | How terrible!"
        
           | progbits wrote:
           | It's also barely better, within the error bar on some
           | metrics. So if this is the best they can do on their own
           | benchmark that's pretty pathetic and shows they have no edge
           | anymore.
        
       | pizzathyme wrote:
       | Non-clinicians are using ChatGPT every day now to try to find
       | assistance (right or wrong) to real-life medical problems. This
       | is a great evaluation set that could prevent a lot of harm
        
       | unsupp0rted wrote:
       | Recently I uploaded a lab report to chatGPT and asked it to
       | summarize it.
       | 
       | It hallucinated serious cancer, along with all the associated
       | details you'd normally find on a lab report. It had an answer to
       | every question I had pre-asked about the report.
       | 
       | The report said the opposite: no cancer detected.
        
         | maliker wrote:
         | Interesting. What LLM model? 4o, o3, 3.5? I had horrible
         | performance with earlier models, but o3 has helped me with
         | health stuff (hearing issues).
        
           | unsupp0rted wrote:
           | Whichever the default free model is right now- I stopped
           | paying for it when Gemini 2.5 came out in Google's AI lab.
           | 
           | 4o, o4? I'm certain it wasn't 3.5
           | 
           | Edit: while logged in
        
             | pants2 wrote:
             | If you're logged in, 4o, if you're not logged int, 4o-mini.
             | Both don't score well on the benchmark!
        
               | askafriend wrote:
               | This gets at the UX issue with AI right now. How's a
               | normie supposed to know and understand this nuance?
        
               | unsupp0rted wrote:
               | Or a non-normie. Even while logged in, I had no idea what
               | ChatGPT model it was using, since it doesn't label it.
               | All the label says is "great for everyday tasks".
               | 
               | And as a non-normie, I obviously didn't take its analysis
               | seriously, and compared it to Grok and Gemini 2.5. The
               | latter was the best.
        
               | unsupp0rted wrote:
               | Added context: While logged in
        
             | maliker wrote:
             | Might be worth trying again with Gemini 2.5. The reasoning
             | models like that one are much better at health questions.
        
               | unsupp0rted wrote:
               | Gemini 2.5 in AI Studio gave by far the best analysis
        
             | icelancer wrote:
             | > Whichever the default free model is right now
             | 
             | Sigh. This is a point in favor of not allowing free access
             | to ChatGPT at all given that people are getting mad at
             | GPT-4o-mini which is complete garbage for anything remotely
             | complex... and garbage for most other things, too.
             | 
             | Just give 5 free queries of 4o/o3 or whatever and call it
             | good.
        
             | dgfitz wrote:
             | I can't believe you're getting downvoted for answering the
             | question about the next-token-predictor model you can't
             | recall using.
             | 
             | What is happening?
        
         | Gracana wrote:
         | I wonder if it was unable to read your report, and just
         | answered as if role-playing?
         | 
         | I gave it a pdf of an engine manual recently and asked some
         | questions, which it answered reasonably. It even pulled a
         | schematic out for me, though it was the wrong one (it gave me a
         | schematic for the CDI ignition variant that we first talked
         | about, rather than the DSAI one we settled on later.)
        
           | astura wrote:
           | >I wonder if it was unable to read your report, and just
           | answered as if role-playing?
           | 
           | FWIW,
           | 
           | This is what chatgpt gave me when I uploaded a lab report, it
           | read the results back to me:
           | 
           | Here's an interpretation of the blood test results you
           | provided:
           | 
           | White Blood Cell Count (WBC) Value: 9.2 (Normal: 3.8 - 10.8
           | Thousand/uL)
           | 
           | Within the normal range, indicating no immediate signs of
           | infection or inflammation.
           | 
           | Red Blood Cell Count (RBC) Value: 5.29 (Normal: 3.8 - 5.1
           | Million/uL)
           | 
           | Elevated (High). This could indicate dehydration, stress, or
           | other conditions like polycythemia.
           | 
           | Hemoglobin Value: 10.9 (Normal: 11.7 - 15.5 g/dL)
           | 
           | Low. Suggests anemia, possibly due to iron deficiency or
           | chronic disease.
           | 
           | Hematocrit Value: 36.0% (Normal: 35.0 - 45.0%)
           | 
           | Within the normal range, but on the lower side, supporting
           | the presence of mild anemia.
           | 
           | Mean Corpuscular Volume (MCV) Value: 68.1 (Normal: 80.0 -
           | 100.0 fL)
           | 
           | Low. Indicates microcytic anemia, often due to iron
           | deficiency.
           | 
           | Mean Corpuscular Hemoglobin (MCH) Value: 20.6 (Normal: 27.0 -
           | 33.0 pg)
           | 
           | Low. Suggests hypochromia (paler red blood cells), often
           | associated with iron-deficiency anemia.
           | 
           | Mean Corpuscular Hemoglobin Concentration (MCHC) Value: 30.3
           | (Normal: 32.0 - 36.0 g/dL)
           | 
           | Slightly low. May reflect reduced hemoglobin in red blood
           | cells, consistent with iron-deficiency anemia.
           | 
           | Red Cell Distribution Width (RDW) Value: 15.8% (Normal: 11.0
           | - 15.0%)
           | 
           | Elevated. Indicates variability in red blood cell size,
           | common in iron-deficiency anemia.
           | 
           | Platelet Count Value: 397 (Normal: 140 - 400 Thousand/uL)
           | 
           | Normal, but on the higher side of the range.
           | 
           | Other Key Metrics Absolute Neutrophils, Lymphocytes,
           | Monocytes, Eosinophils, Basophils: All within normal ranges,
           | suggesting no significant abnormalities in immune cell
           | populations.
           | 
           | Anisocytosis (1+): Indicates variation in red blood cell
           | size.
           | 
           | Hypochromasia (1+): Suggests pale red blood cells, supporting
           | iron-deficiency anemia.
           | 
           | Summary These results suggest iron-deficiency anemia,
           | characterized by low hemoglobin, low MCV, low MCH, and high
           | RDW. The elevated red blood cell count may reflect
           | compensatory mechanisms or mild dehydration.
           | 
           | Your provider recommends a heart-healthy, low-fat diet, which
           | is great for overall health. However, addressing the anemia
           | might also involve increasing dietary iron or taking iron
           | supplements.
           | 
           | The diagnosis is wrong, btw, I don't have iron deficiency.
           | The anemia is caused by a genetic condition called
           | thalassemia, which has been verified by genetic tests. You
           | can use the Mentzer Index to differentiate the two on a
           | simple CBC - https://www.mdcalc.com/calc/10534/mentzer-index
           | 
           | Mine numbers return a "probable diagnosis."
           | 
           | I was wondering if chatgpt would catch it, nope, it didn't.
           | It did say that it was a possibility once I suggested it
           | though.
        
             | nradov wrote:
             | It's also disappointing that the LLM would falsely claim
             | that a low-fat diet is "great for overall health". That
             | claim has been thoroughly debunked in the scientific
             | literature (or at least not supported by evidence-based
             | medicine), and yet the recommendation still keeps popping
             | up. While there are certain specific types of fats that can
             | be problematic, in general a moderate amount of fat is fine
             | as a macronutrient.
        
               | dekhn wrote:
               | LLMs don't have a magical ability to arbitrate complex
               | truths. If the "low-fat diet" narrative is prominent in
               | the training literature, I would expect the model to say
               | that more often that it said the opposite.
               | 
               | I don't ever really use the term "thoroughly debunked"
               | when referring to nutrition science; as you noted, a
               | better term is that the claim is not supported by the
               | evidence. I've seen enough things debunked and then
               | rebunked to know that nutrition science is not really
               | that accurate.
        
         | arcanemachiner wrote:
         | No, cancer detected!
        
           | mvdtnz wrote:
           | You're not on Reddit.
        
             | arcanemachiner wrote:
             | I typically respect the decorum here, but I couldn't help
             | myself this time.
        
         | dekhn wrote:
         | What was the outcome (did the subject of the lab report go on
         | to present cancer)? (sometimes what we think is a false
         | positive ends up being a true positive later, and cancer is a
         | late-arriving signal).
        
       | iNic wrote:
       | I like that they include the "worst case score at k samples".
       | This is a much more realistic view of what will happen, because
       | someone will get that 1/100 response.
        
       | pants2 wrote:
       | This appears to be a very thoughtful and helpful study. It's also
       | impressive to see the improvement in performance in just the last
       | year of model development - almost double.
       | 
       | I've found o3 & deep research to be very effective in guiding my
       | health plan. One interesting anecdote - I got hit in the chest
       | (right over the heart) quite hard a month or so ago. I prompted
       | o3 with my ensuing symptoms and heart rate / oxygenation data
       | from my Apple watch, and it already knew my health history from
       | previous conversations. It gave very good advice and properly
       | diagnosed me with a costochondral sprain. It gave me a timeline
       | to expect (which ended up being 100% accurate) and treatments /
       | ointments to help.
       | 
       | IMO - it's a good idea to have a detailed prompt ready to go with
       | your health history, height/weight, medications and supplements,
       | etc. if anything's happening to you you've got it handy to give
       | to o3 to help in a diagnosis.
        
         | quantumwoke wrote:
         | How confident was it in that diagnosis? Did you trust its
         | confidence if so? I asked this to my medico wife and apparently
         | the other possibility is pericardial effusion from injury which
         | can be life threatening.
        
       | simianwords wrote:
       | I would really rather like a benchmark purely focusing on
       | diagnosis. Symptoms, patient history vs the real diagnosis. Maybe
       | name this model House M.D 1.0 or something.
       | 
       | The other stuff is good to have but ultimately a model that
       | focuses on diagnosing medical conditions is going to be the most
       | useful. Look - we aren't going to replace doctors anytime soon
       | but it is good to have a second opinion from an LLM purely for
       | diagnosis. I would hope it captures patterns that weren't
       | observed before. This is exactly the sort of thing game that AI
       | can beat a human at - large scale pattern recognition.
        
       | mrcwinn wrote:
       | Happy to see this. I've struggled with an injury for the past
       | five years. I've been to multiple sports-focused physicians, had
       | various scans. Responses from doctors have ranged from
       | "everything seems fine, can't really figure this out" to
       | [completely wrong hypothesis]. Tried acupuncture. Tried a
       | chiropractor. I remember one doctor, though, had an interesting
       | thought that seemed to make sense - but I've been so discouraged
       | from so many false starts or misplaced hope, I didn't bother
       | following up.
       | 
       | Finally I typed in my entire history into o3-deep-research and
       | let it rip for a while. It came back with a theory for the injury
       | that matched that one doctor, diagrams of muscle groups and even
       | illustrations of proposed exercises. I'm not out of the woods
       | yet, but I am cautiously optimistic for the first time in a long
       | time.
        
         | Noumenon72 wrote:
         | I hope recent cuts to government science have managed to hit
         | enough of the safetyists and industry captures who keep us from
         | just trying out new healthcare approaches like this and
         | learning. They'd like nothing better than to replace the help
         | you got with "As a large language model, I am unable to offer
         | medical advice."
        
         | candiddevmike wrote:
         | Why would you trust a LLM over a battery of human experts? I
         | find it hard to believe that the doctors never proposed
         | exercises or some kind of physical therapy for you, at least in
         | the US.
        
           | BeetleB wrote:
           | I can't speak to the OP's condition, but having seen plenty
           | of doctors and physical therapists in the US for over a
           | decade:
           | 
           | Yes, they propose exercises.
           | 
           | No, they don't work.
           | 
           | For certain (common) conditions, PT seems to have it nailed -
           | the exercises really help. For the others, it's just snake
           | oil. Not backed by much research. The current state of the
           | art is just not good when it comes to chronic pain.
           | 
           | So while I don't know if an LLM can be better than a battery
           | of human experts, I do know that those human experts do not
           | perform well. I'm guessing with the OP's case, that battery
           | of human experts does not lead to a consensus - you just end
           | up with 10 different treatments/diagnoses (and occasionally,
           | one is a lot more common than the other, but it's still
           | wrong).
        
           | zeroxfe wrote:
           | I've never used LLMs for this, but as someone who's been
           | through a lot of sports-related injuries, I find doctors more
           | or less useless (except for prescribing painkillers and
           | performing surgeries.)
           | 
           | No doctor or physio has ever been able to fix my chronic
           | issues, and I've always had to figure them out myself through
           | lots of self-study and experimentation.
        
           | n8henrie wrote:
           | Much of modern medicine's expertise is focused on areas that
           | reimburse well (in some part because those industries
           | subsidize the research).
           | 
           | I would not expect most physicians to have a deep fund of
           | literature-backed knowledge to draw from regarding exercise.
           | Telling someone to do an exercise probably doesn't compensate
           | well.
           | 
           | That said, I'm also pretty negative about the availability of
           | rigorous literature regarding much of nutrition, dentistry,
           | podiatry, physical therapy, etc... you know, the things that
           | affect the health of most human beings that have ever lived.
        
         | scudsworth wrote:
         | i mean it sounds like that doctor had it right then? i dont
         | think this reads as a win for "online self-diagnosis", ai-
         | empowered or otherwise.
        
           | lcnPylGDnU4H9OF wrote:
           | From their story, it sounds like the one doctor had it right
           | because the AI-empowered self diagnosis seems to agree. Or do
           | I misunderstand?
        
         | mvdtnz wrote:
         | You trusted known grifters acupuncturists and chiropractors,
         | but not doctors. Then you trusted a computer known for telling
         | lies.
         | 
         | I think you should take a step back and re-assess your internal
         | heuristics.
        
           | BigGreenJorts wrote:
           | They literally said they saw sport focused physicians.
        
         | quantumwoke wrote:
         | Sounds like you needed a good PT not a physician.
        
         | nradov wrote:
         | You have to be realistic and assume that treating most major
         | musculoskeletal injuries will be a lengthy trial-and-error
         | process. Every injury is unique and imaging studies or other
         | tests can only give limited clues about what's going on. There
         | are patients in terrible pain whose scans look normal, and
         | there are other patients whose scans indicate serious pathology
         | yet they are pain-free and completely functional. Instead of
         | getting discouraged, keep trying different modalities (except
         | for the pseudo-science BS) until something works.
         | 
         | In general a lot of those injuries will eventually heal on
         | their own. So it's easy to fool yourself into believing that a
         | particular treatment was effective even when the real cure was
         | time.
        
       | kypro wrote:
       | Why are all the label colours for the "Worst-case HealthBench
       | score at k samples" chart the same colour and the same shape?
       | Completely unreadable.
        
         | brap wrote:
         | The colors are slightly different. I think the person why made
         | this was more concerned with appearance and not providing
         | information.
        
       | andy99 wrote:
       | My sense is that these benchmarks are not realistic in terms of
       | the way the model is used. People building specialized AI systems
       | are not, in my experience, letting users just chat with a base
       | model, they would have some variant of RAG plus some guardrails
       | plus other stuff (like routing to pre-written answers for common
       | question).
       | 
       | So what use case does this test setup reflect? Is there a
       | relevant commercial use case here?
        
       | programmertote wrote:
       | I have no doubt that a lot of garden-variety diagnoses and
       | treatments can be done by an AI system that is fine-tuned and
       | vetted to accomplish the task. I recently had to pay $93 to have
       | a virtual session with a physician to get prescription for a
       | cough syrup, which I already knew what to take before talking to
       | her because I did some research/reading. Some may argue, "Doctors
       | studied years in med school and you shouldn't trust Google more
       | than them", but knowing human's fallibility and knowing that a
       | lot of doctors do look things up on places like
       | https://www.wolterskluwer.com/en/solutions/uptodate to
       | refresh/reaffirm their knowledge, I'd argue that if we are
       | willing to take the risk, why shouldn't we be allowed to take
       | that risk on our own? Why do I have to pay $93 (on top of the
       | cough syrup that costed ~$44) just so that the doctor can see me
       | on Zoom for less than 5 mins and submit an order for the med?
       | 
       | With the healthcare prices increasing at the breakneck speed, I
       | am sure AI will take more and more role in diagnosing and
       | treating people's common illnesses, and hopefully (doubt it), the
       | some of that savings will be transferred to the patients.
       | 
       | P.S. In contrast to the US system, in my home city (Rangoon,
       | Burma/Myanmar), I have multiple clinics near my home and a couple
       | of pharmacy within two bus stops distance. I can either go buy
       | most of the medications I need from the pharmacy (without
       | prescription) and take them on my own (why am I not allowed to
       | take that risk?) OR I can go see a doctor at one of these clinics
       | to confirm my diagnosis, pay him/her $10-$20 for the visit, and
       | then head down to the pharmacy to buy the medication. Of course,
       | some of the medications that include opioids will only be sold to
       | me with the doctor's prescription, but a good number of other
       | meds are available as long as I can afford them.
        
         | BeetleB wrote:
         | Where are you that you need a prescription to get cough
         | medicine? The only ones I know of that require prescription are
         | the ones with controlled substances.
        
           | dgfitz wrote:
           | > Of course, some of the medications that include opioids
           | will only be sold to me with the doctor's prescription
        
           | HaZeust wrote:
           | Ain't nothing wrong with a little Wockhardt to get you
           | through the day
        
         | ezst wrote:
         | I mean, if such is your typical experience with healthcare,
         | what you need is a big shake up, don't count on AI to help you
         | there! (but it might inflate the profit margins for the few
         | running the show at your expense).
        
         | yapyap wrote:
         | You think you won't have to pay the AI? lol.
        
           | hollerith wrote:
           | IMHO it is much easier to increase the supply of AI doctor
           | hours than the supply of human doctor hours, and supply
           | constraints are the main thing maintaining doctor hours at
           | the high price it currently sells for.
        
         | onlyrealcuzzo wrote:
         | There are simply not enough doctors for doctors to be doing
         | this work.
         | 
         | We have a massive, massive shortage of doctors.
         | 
         | The industry is doing everything they can to make it worse by
         | the day, so I won't hold my breath that we'll get the slightest
         | bit of respite.
         | 
         | It'd obviously be ideal if everyone could see a doctor for an
         | affordable price any time they wanted.
         | 
         | We don't live in the ideal world.
         | 
         | This would be a HUGE win for most people.
        
           | nradov wrote:
           | This is why most routine primary care can be delivered by
           | Nurse Practitioners and Physician Assistants working under a
           | doctor's supervision.
        
             | closewith wrote:
             | NPs and community paramedics are definitely the future for
             | a huge chunk of the current GP workload, but PAs are a
             | failed experiment which are being rolled back pretty much
             | everywhere.
        
               | nradov wrote:
               | Are you sure about that? The ranks of PAs have been
               | expanding rapidly. I know there has been some political
               | resistance over _expanding_ their scope of practice but I
               | 'm not aware of it being significantly rolled back.
               | 
               | https://www.bls.gov/ooh/healthcare/physician-
               | assistants.htm
        
               | closewith wrote:
               | [delayed]
        
         | imiric wrote:
         | > Why do I have to pay $93 (on top of the cough syrup that
         | costed ~$44) just so that the doctor can see me on Zoom for
         | less than 5 mins and submit an order for the med?
         | 
         | Because you're paying for the expertise of someone who studied
         | for more than a decade which you won't get from a random web
         | search.
         | 
         | An AI system with today's technology should be less trustworthy
         | for medical diagnosis than a web search. At least with a web
         | search you might stumble upon a site with content from experts,
         | assuming you trust yourself to be able to discern expert advice
         | from bot-generated and spam content. Even if a doctor is doing
         | the searching instead of me, I would pay them only for their
         | knowledge to make that discernment for me. Why you think an AI
         | could do better than a human at that is beyond me.
         | 
         | Your question reminds me of that famous Henry Ford GE invoice
         | story:
         | 
         | > Making chalk mark on generator: $1.
         | 
         | > Knowing where to make mark: $9,999.
        
           | 85392_school wrote:
           | AI systems have been improving. O3 now has the capability to
           | decide to search multiple times as part of its response.
        
             | creata wrote:
             | Even o4-mini (free) uses web searches and runs Python
             | scripts very eagerly. I'm not sure how long they'll be able
             | to afford giving all of that away.
        
           | ChadNauseam wrote:
           | You answered why someone would want to pay $93, but not why
           | they _have_ to pay $93.
        
           | hiatus wrote:
           | There's a limit though right? How about, if you can
           | accurately predict the doctor's diagnosis it's free? If not
           | you pay. This person needed a doctor's approval presumably
           | for a prescription they couldn't get without authorization
           | which leads to this gatekeeping. Not to mention also
           | contributes to the insane medical costs in the US. $93 for 5
           | minutes is over 1k _an hour_ for what amounts to a rubber
           | stamp (because how much can you really dig into a person's
           | medical history and condition in 5 mins).
        
           | wubrr wrote:
           | I mean, it really depends on what you're trying to
           | diagnose/do. Yeah, a math PhD might have spent a decade+
           | studying the subject. Doesn't mean I need to pay for their
           | skills instead of using a calculator/program to solve my math
           | problem.
           | 
           | > Why you think an AI could do better than a human at that is
           | beyond me.
           | 
           | Why do you think an AI couldn't do better than a human, when
           | we have ample evidence of computers/AI exceeding humans in
           | many areas?
        
         | const_cast wrote:
         | The only reason this worked is because your situation was
         | exceedingly simple.
         | 
         | The trouble is you are not educated enough to tell what is
         | simple and what isn't. A cough could be a cough or it could be
         | something more serious, only a "real" examination will reveal
         | that. And sometimes even that's not enough, you need an
         | examination by a specialists.
         | 
         | I'll tell you a story. Once upon a time I got pain in my balls.
         | I went to a doctor and he felt around and he said he didn't
         | feel anything. I went to another doctor and he felt _something_
         | , but he had no idea what it was. He said could be a cyst,
         | could be a swollen vein, could be an infection - he didn't even
         | know if it was on the testicle or on the tube thingy.
         | 
         | Then I went to a Urologist. You can tell this man has felt up a
         | _lot_ of balls. He felt me up and said,  "yup, that's a tumor"
         | almost immediately. He was right, of course, and he ended up
         | being the one to remove it too. Since I caught the cancer
         | pretty early the chemotherapy wasn't too intense.
         | 
         | Point is, expertise matters when things aren't straight
         | forward. Then, experience and perspective gets to shine.
        
       | imiric wrote:
       | Good lord. The idea that a system that produces pseudo-random
       | output without any semantic understanding can be relied on to
       | give accurate health-related information is deeply flawed and
       | troubling. It's one thing to use these things for finding
       | patterns in data, for entertainment purposes, and for producing
       | nonsensical code a human has to fix, but entirely different to
       | rely on them for health diagnosis or advice. I shudder at the
       | thought that a medical practitioner I go to will parrot whatever
       | an LLM told them.
       | 
       | This insanity needs to be regulated yesterday.
        
         | FlamingMoe wrote:
         | Good lord. The idea that a sleep-deprived, bias-prone mammal
         | can be relied on to give accurate health-related information is
         | deeply flawed and troubling. It's one thing to seek a human for
         | empathy, small talk, or a routine vaccine, but entirely
         | different to assume an overworked doctor will consistently out-
         | diagnose a modern data tool. I shudder at the thought that the
         | practitioner I visit might simply repeat whatever a drug-rep
         | lunch pitch or outdated guideline left in their memory.
        
           | stavros wrote:
           | This insanity needs to be regulated yesterday.
        
             | imiric wrote:
             | Thankfully, it is. :)
        
               | hiatus wrote:
               | Are there regulations regarding doctor's working hours
               | similar to say, truck drivers? I found this [1] article
               | which states doctors are limited to 80 hours a week
               | whereas truck drivers are limited to 70 hours in a 7 day
               | period. [2]
               | 
               | [1] https://psnet.ahrq.gov/primer/duty-hours-and-patient-
               | safety
               | 
               | [2] https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/d
               | ocs/Dri...
        
               | n8henrie wrote:
               | Those restrictions are for residents (trainees).
               | 
               | They were also found not to improve patient outcomes
               | (possibly due to increased number of handoffs, which are
               | highly error prone).
        
         | creata wrote:
         | Can you state more clearly _why_ it 's deeply flawed?
         | 
         | Because while LLMs obviously have massive limitations, so do
         | humans, and it's not entirely clear to me that some synthesis
         | of the two can't produce much better results than either on its
         | own.
        
       | yapyap wrote:
       | Sam Altman does not care about "improving human health"
        
       | NKosmatos wrote:
       | Most probably I'm going to get downvoted, but I'm gonna say it...
       | 
       | It's a pity they don't support Greek language, keeping in mind
       | that almost all medical terminology has Greek origins.
       | 
       | Anyhow, this is a step in the good direction and for sure it will
       | aid many people looking for medical assistance via ChatGPT.
        
       ___________________________________________________________________
       (page generated 2025-05-12 23:00 UTC)