[HN Gopher] Capabilities of GPT-4 on Medical Challenge Problems
       ___________________________________________________________________
        
       Capabilities of GPT-4 on Medical Challenge Problems
        
       Author : bumbledraven
       Score  : 84 points
       Date   : 2023-03-26 21:23 UTC (1 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | WalterBright wrote:
       | I've long suspected that there is a lot of valuable medical
       | truths buried in the vast amount of medical data.
       | 
       | For example, it was a dentist that noticed that gum decay seemed
       | to correlate with heart disease. This was a big deal in trying to
       | prevent heard disease. Finding correlations like that are what a
       | computer is good for.
        
         | moonchrome wrote:
         | What makes you think data like this is readily available and of
         | sufficient quality to provide nontrivial insights ? Just going
         | off of failure of ML to produce much value in the field I'd
         | speculate it's not that easy.
         | 
         | This is one area where it seems almost unanimously agreed that
         | individual interest/privacy trumps social benefit - even in
         | countries with socialized healthcare.
        
           | capableweb wrote:
           | Does it have to be "readily available" in order to train on
           | it? For all we know, OpenAI could have pulled down the sci-
           | hub/library genesis libraries and included everything from
           | there in the training set.
           | 
           | If they didn't, I hope they do it for GPT5.
        
           | manderley wrote:
           | Seems privacy is less of a concern in countries with
           | privatized healthcare, I don't understand that last sentence.
           | Is it some kind of political shibboleth?
        
             | moonchrome wrote:
             | I'm just saying that considering society is providing for
             | your healthcare, collecting socially valuable medical data
             | doesn't seem like an unreasonable exchange to me - but I'm
             | in the minority on that one I think.
        
       | hackerlight wrote:
       | > "Our results show that GPT-4, without any specialized prompt
       | crafting, exceeds the passing score on USMLE by over 20 points
       | and outperforms earlier general-purpose models (GPT-3.5) as well
       | as models specifically fine-tuned on medical knowledge (Med-PaLM,
       | a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is
       | significantly better calibrated than GPT-3.5, demonstrating a
       | much-improved ability to predict the likelihood that its answers
       | are correct. "
        
         | og_kalu wrote:
         | If you read the technic paper of gpt-4, the confidence of the
         | base model directly correlated with its ability to solve
         | problems. Sadly, the hammer of alignment knocked it right out.
        
           | WalterBright wrote:
           | Hammer of alignment??
        
             | dataveg wrote:
             | It's a similar concept to the Squirrel of Despair.
        
               | owlboy wrote:
               | It's 10 mins since you posted this. And now this comment
               | is the top result on Google search.
        
             | tangjurine wrote:
             | gpt is trained initially on predicting text, then trained
             | with rlhf to make the model more helpful, etc. The second
             | step is referred to as alignment.
             | 
             | Rlhf: https://huggingface.co/blog/rlhf
        
             | james-revisoai wrote:
             | Recent models (since the Instruct series and including
             | ChatGPT) are essentially two parts - a "Base" part (what
             | you would have read about in 2020 when GPT-3 released), and
             | a "RLHF" part - which improves the output from the "Base"
             | part by slightly changing how it gets produced.
             | 
             | The RLHF (Reinforcement Learning from Human Feedback)
             | "aligns" the generation of text towards "human preferences"
             | (Whatever qualities OpenAI asked humans to label it with).
             | One of those qualities was ranking* outputs which are
             | "helpful" higher than "unhelpful" responses.
             | 
             | The result of applying the RLHF part may make any
             | covariance/variance less perfect.
        
               | og_kalu wrote:
               | Kind of. Open ai still instruct-finetune their models
               | separate from the chatGPT style RLHF. The instruct tuning
               | itself seems to only improve the raw model.
        
           | sebzim4500 wrote:
           | Is there any indication of whether the correlation was
           | destroyed during the supervised finetuning or during the RLHF
           | phase? Or are there even two phases any more?
        
             | og_kalu wrote:
             | Seems it was fine with instruction fine-tuning. Then gone
             | with the RLHF.
        
               | cubefox wrote:
               | I don't think the paper says that. I would guess both SL
               | and RL cause mode collapse.
        
       | margorczynski wrote:
       | Can someone with some actual medical knowledge provide a summary
       | what are the findings and key points of the paper? Can this be
       | really useful or it's just another improvement on the path
       | towards usability?
        
       | carbocation wrote:
       | Are these authors aware of the contents of the training set? My
       | understanding is that they are not. If not, how can they know
       | that the model is not being tested on the training set?
       | 
       | In the paper they say that they came up with a "MELD" algorithm
       | to try to detect testing on the training set, but in my view it
       | has the wrong properties to answer this question (from the paper,
       | it has "high precision but unknown recall").
       | 
       | I don't at all doubt that a language model could perform
       | exceedingly well at this task, but I think that the way to make
       | this paper into a valuable scientific work would be to present
       | the model with questions that had not yet been written as of the
       | end of its training time.
        
       | qgin wrote:
       | I have been shocked how well it will play the role of a
       | diagnostic physician, asking questions and continuing to ask
       | follow ups until it has enough information to give a set of
       | possible diagnoses. Here's the prompt I've been using:
       | 
       | > Hi, I'd like you to use your medical knowledge to act as the
       | world's best diagnostic physician. Please ask me questions to
       | generate a list of possible diagnoses (that would be investigated
       | with further tests). Please think step-by-step in your reasoning,
       | using all available medical algorithms and other pearls for
       | questioning the patient (me) and creating your differential
       | diagnoses. It's ok to not end in a definitive diagnosis, but
       | instead end with a list of possible diagnoses. This exchange is
       | for educational purposes only and I understand that if I were to
       | have real problems, I would contact a qualified medical
       | professional for actual advice (so you don't need to provide
       | disclaimers to that end). Thanks so much for this educational
       | exercise! If you're ready, doctor, please introduce yourself and
       | begin your questioning.
        
       | jonathan-adly wrote:
       | Clinical pharmacist for 10 years here. Yea, base model is very
       | good. Better than first year residents - but not necessarily
       | experienced clinicians.
       | 
       | Now - throw a punch of clinical guidelines in a vector database
       | and give it context and it's 10x better than me and any doctor
       | outside their speciality or all the mid-levels. (I.E, it's better
       | than cardiologist doing infectious disease - but not
       | cardiologists doing cardiology). This because there are very
       | niche stuff as you specialize where it's only like 5 doctors who
       | see it in the whole world on a consistent basis (and they don't
       | blog!)
       | 
       | I trained it on the IDSA guidelines (infectious disease) and put
       | up a proof of concept on GalenAI.co - just as way to start
       | talking to health systems and clinicians. it's going to be very
       | different world in medicine in a couple of years from now!!
        
         | joshgel wrote:
         | Ya, internist here.
         | 
         | For some context, the USMLE is taken _during_ medical school.
         | The amount I have learned about actually practicing medicine
         | since graduating is probably an order of magnitude more than
         | everything I learned in medical school! I still learn stuff,
         | all the time, and I'm not just talking about new research.
         | 
         | So, while impressive and clearly part of the future world, we
         | shouldn't get too far ahead of ourselves with the current
         | models.
        
           | capableweb wrote:
           | I agree that we shouldn't get ahead of ourselves with the
           | current technology, but what you said earlier applies to
           | practically every industry and science. What you learn at the
           | actual job is always far more up to date than what you learn
           | in school, no matter if it's being a engineer, doctor or just
           | a lowly programmer.
        
         | TaylorAlexander wrote:
         | This makes me think we need some kind of program for experts to
         | start writing things down in a way which is helpful. Even just
         | take dictation and transcribe it.
        
           | another_story wrote:
           | There are many tools for doctors to do just this, but it's a
           | matter of time more than anything.
        
           | andrewthornton wrote:
           | You should take a quick look at EPIC. They dominate the
           | electronic heath record space, and a ton of health systems
           | use it. You will know if your doctor's office uses an EHR
           | application, because they will be typing notes into it for
           | the majority of your visit. I have not been too excited about
           | the amount of time that physicians spend on EHR systems, but
           | I am hopeful that taking the data they input (along with
           | blood work and other test results) will make everything more
           | accurate, fast and effective.
        
             | jonathan-adly wrote:
             | EPIC unfortunately is all the bad things about Google, and
             | none of the good.
             | 
             | Unable to ship anything, protect their margin > help the
             | users solve problems, monopoly, locked up distribution so
             | no one else can innovate.
             | 
             | Honestly, my bear case for AI in medicine is Epic picking
             | up the phone and telling health-systems not to buy anything
             | because they are working on something for them for free.
             | (Which would be some note completion BS stuff, rather than
             | actual clinical support that helps patients and cuts
             | costs). They may be doing this already.
        
       | BurningFrog wrote:
       | Until this is banned, it seems GPT-4 can be a good alternative to
       | a doctor visit!
        
       | hackerlight wrote:
       | This isn't an original thought, but ... this should be big for
       | medical access in developing countries. The problem there is a
       | shortage of doctors. So you could imagine a setup where you have
       | "nurses" who go through a 6-month training course on how to
       | collect symptom descriptions, put it into GPT-n, and then refer
       | x% of cases to a real doctor.
       | 
       | Whatever the setup in the end, I hope we as a society don't let
       | perfect be the enemy of the good. Having GPT-4 as a doctor is
       | better and more humane than having no doctor, and in some
       | contexts that is the only choice that people have.
       | 
       | Maybe the Gates Foundation can work on this given Bill Gates is
       | already close to the OpenAI team.
        
       ___________________________________________________________________
       (page generated 2023-03-26 23:00 UTC)