[HN Gopher] GPT-4V(ision) Unsuitable for Clinical Care and Educa...
       ___________________________________________________________________
        
       GPT-4V(ision) Unsuitable for Clinical Care and Education: An
       Evaluation
        
       Author : PaulHoule
       Score  : 55 points
       Date   : 2024-03-26 19:19 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | zettabomb wrote:
       | I question the competence of anyone using any modern AI (not just
       | GPTs) for medical decisions. Beyond being capable of passing a
       | multiple-choice exam (which can be done by trained monkeys)
       | they're not ready for this, and they won't be for years. I guess
       | confirmation of this is still good to know.
        
       | valedan wrote:
       | This seems like a pretty useless study as they don't collect any
       | results from human doctors, therefore there is nothing to compare
       | their GPT-4V results to.
        
         | acchow wrote:
         | Instead of comparing against some "average doctor", they used a
         | few doctors as "source of truth"
         | 
         | > All images were evaluated by two senior surgical residents
         | (K.R.A, H.S.) and a board-certified internal medicine physician
         | (A.T.). ECGs and clinical photos of dermatologic conditions
         | were additionally evaluated by a board-certified cardiac
         | electrophysiologist (A.H.) and dermatologist (A.C.),
         | respectively
        
           | hangsi wrote:
           | I think the parent comment was referring to something else.
           | 
           | In the paper the tasks are only completed by GPT-4V. For a
           | valid scientific investigation, there should be a control set
           | completed by e.g. qualified doctors. When the panel of
           | experts does their evaluation, they should rate both sets of
           | responses so that the difference in score can be compared in
           | the paper.
        
             | acchow wrote:
             | Agreed. Those are different evaluations (is what I meant by
             | "Instead of comparing against"). The paper cannot conclude
             | that "doctors are better/more correct"
             | 
             | It assumes that "here are 5 doctors which are always
             | correct". Then measures GPT's correctness against them.
        
       | z2 wrote:
       | This is expected, right? It would be surprising if something as
       | general as GPT-4V was trained on a diverse and nuanced set of
       | radiology images, vs say, a traditional CNN trained and refined
       | for particular detection of specific diseases. It feels akin to
       | concluding that a picnic basket doesn't make a good fishing
       | toolbox after all. Worse would be if someone in power was
       | actually enthusiastically recommending plain GPT-V as a realistic
       | solution for specialized vision tasks.
        
       | shmatt wrote:
       | This feels like it was done for the clicks and not actual
       | research. There are plenty of AI startups in the radiology game,
       | GPT-4V isn't free - so why is that the one being tested?
       | 
       | Run research on models that are actually trained to solve these
       | issues, that is relevant research
       | 
       | Just as an example, Viz.ai applied[1] and received FDA
       | clearance/approval for their model in hospitals.
       | 
       | Have OpenAI ever submitted a request for use of GPT-4V? Whats
       | next, try autopilot driving with GPT-4V?
       | 
       | [1] https://www.viz.ai/news/viz-ai-receives-
       | fda-510k-clearance-f...
        
       | minimaxir wrote:
       | Last week, a tweet went viral of a Claude user fighting with
       | radiologists claiming the LLM found a tumor when the radiologists
       | did not (most of the replies were rightfully dunking on it):
       | 
       | > A friend sent me MRI brain scan results and I put it through
       | Claude.
       | 
       | > No other AI would provide a diagnosis, Claude did.
       | 
       | > Claude found an aggressive tumour.
       | 
       | > The radiologist report came back clean.
       | 
       | > I annoyed the radiologists until they re-checked. They did so
       | with 3 radiologists and their own AI. Came back clean, so looks
       | like Claude was wrong.
       | 
       | > But looks how convincing Claude sounds! We're still early...
       | 
       | https://twitter.com/misha_saul/status/1771019329737462232
        
         | binarymax wrote:
         | This is super dangerous. It's WebMD all over again but much
         | worse. It's hard enough to diagnose but now you have to fight
         | against some phantom model that the patient thinks is smarter
         | than their doctor.
        
         | mateo1 wrote:
         | After reading the diagnosis I caught myself wanting to examine
         | the MRI to see if the bright area really exists, which means I
         | fell for this too. Imagine being the person who received this
         | diagnosis. Of course you're going to be concerned, even if it
         | is LLM garbage.
        
       | simonw wrote:
       | Have there been many studies like this one that have been judged
       | blind?
       | 
       | I'd trust a study like this a little more if the human evaluators
       | were presented with the output of GPT-4 mixed together with the
       | output from human experts, such that they didn't know if the
       | explanation they were evaluating came from a human or an LLM.
       | 
       | This would reduce the risk that participants in the study,
       | consciously or subconsciously, marked down AI results because
       | they knew them to be AI results.
        
       | km3r wrote:
       | Healthcare is stupid expensive. Even for those with government
       | provided coverage, it still costs a lot of your tax dollars.
       | 
       | Clearly we are not there yet, but within certain conditions we
       | are getting close. If we enable nurses and doctors to be more
       | productive with these tools, we absolutely should explore that.
       | Keep a doctor in the loop as AI is still error prone, but you can
       | line things up in a way such that doctors make the decision with
       | AI gathered information.
        
         | iknowstuff wrote:
         | Yeah, I don't think we're more than a decade off from GPs being
         | replaceable with AI. But you can bet they will lobby hard
         | against it.
        
           | Etheryte wrote:
           | This is deeply wishful thinking. We're not even remotely
           | close to self driving cars and that is numerous orders of
           | magnitude easier than trying to diagnose a human being with
           | anything.
        
         | cmiles74 wrote:
         | IMHO, we need structural change in the way US healthcare works.
         | Slathering some AI on top won't solve our cost problems but it
         | will negatively impact our already declining quality of patient
         | care.
        
       | atleastoptimal wrote:
       | I'm not specifically referring to this article, but in general
       | I've noticed a frustrating pattern:
       | 
       | > AI company releases generalist model for
       | testing/experimentation
       | 
       | > Users unwisely treat it like a universal oracle and give it
       | tasks far outside its training domain
       | 
       | > It doesn't perform well
       | 
       | > People are shocked and warn about the "dangers of AI"
       | 
       | This happens every time. Why can't we treat AI tools like they
       | actually are: interesting demonstrations of emergent intelligent
       | properties that are a few versions away from production-ready
       | capabilities?
        
         | pmontra wrote:
         | Not in the human nature. A friend of mine just asked me how to
         | make ChatGPT create a PowerPoint presentation. He meant the
         | pptx file. It can't. Googling for him I learned that of course
         | it can create the text and a little more surpringly even the
         | VBA program that creates the slides. That's our of scope for
         | that friend of mine. He was very surprised. He was like, with
         | all it can do why not the slides?
        
           | jsight wrote:
           | If you down mind marp (markdown for slides), it can do a
           | pretty good job of generating that.
        
           | exe34 wrote:
           | Can it output LaTeX? If so, you could try beamer.
        
         | minimaxir wrote:
         | Because the average user of AI is not a Hacker News user who
         | understands its limitations, and the ones who _do_ understand
         | their limitations tend to exaggerate and overhype it to make
         | people think it can do anything. The only real fix for it is
         | for companies like OpenAI to encourage better usage (e.g.
         | tutorials), but there 's no incentive for them to do so yet.
         | 
         | I wrote a rant a few months back about the greatest threat to
         | generative AI is people using it poorly:
         | https://minimaxir.com/2023/10/ai-sturgeons-law/
        
           | PaulHoule wrote:
           | Then there are the people who should know better but get
           | seduced by the things, which they are really good at.
        
         | nerdponx wrote:
         | Same reason we can't trust people to drive 35 MPH when the road
         | is straight and wide, no matter how many signs are posted to
         | declare the speed limit. It's just too tempting and easy to
         | become complacent.
         | 
         | That, and these companies have a substantial financial interest
         | in pushing the omniscience/omnipotence narrative. OpenAI trying
         | to encourage responsible AI usage is like Phillip Morris trying
         | to encourage responsible tobacco use. Fundamental conflict of
         | interest.
        
         | PaulHoule wrote:
         | I spoke w/ Marvin Minsky once back in the 1990s and he told me
         | that he thought "emergent properties" were bunk.
         | 
         | As for the future I am certain LLMs will become more efficient
         | in terms of resource consumption and easier to train, but I am
         | not so certain that they're going to fundamentally solve the
         | problems that LLMs have now.
         | 
         | Try to train one to tell you what kind of shoes a person is
         | wearing in an image and it will likely "short circuit" and
         | conclude that a person with fancy clothes is wearing fancy
         | shoes (true much more often than not) even if you can't see
         | their shoes at all. (Is a person wearing a sports jersey and
         | holding a basketball on a basketball court necessarily wearing
         | sneakers?) This is one of those cases where bias looks like it
         | is giving better performance but show a system like that a
         | basketball player wearing combat boots and it will look stupid.
         | So much of the apparent high performance LLMs comes out of this
         | bias and I'm not sure the problem can really be fixed.
        
           | jsight wrote:
           | I doubt it would take very many suitable examples in the
           | training dataset to fix problems like that.
        
             | PaulHoule wrote:
             | If it sees 90% of photos of basketball players wearing
             | sneakers it is still going to get the idea that basketball
             | players wear sneakers. I guess I could make up some
             | synthetic data where I mix and match people's shoes but
             | it's a strange way to get at the problem: train on some
             | image set that has completely unrealistic statistics to try
             | to make it pay attention to particular features as opposed
             | to other ones.
             | 
             | Problems like that have been bugging me for a while, if you
             | had some Bayesian model you could adjust the priors to make
             | the statistics match a particular set of cases and it would
             | be nice if you could do the same with a wide range of
             | machine learning approaches. For instance you might find
             | that 90% of the cases are "easy" cases that seem like a
             | waste to include in the training data, keeping there gives
             | the model the right idea about the statistics but may burn
             | up the learning capacity of the model such that it can't
             | really learn from the other 10%.
             | 
             | I talked w/ some contractors who made intelligence models
             | for three-letter agencies and they told me all about ways
             | for dealing with that come down to building multiple-staged
             | models where you build a model that separates the hard
             | cases from the easy cases with specialized training sets
             | for each one. It's one of those things that some people
             | have forgotten in the Research Triangle Park area but
             | Silicon Valley never knew.
        
           | CamperBob2 wrote:
           | _I spoke w / Marvin Minsky once back in the 1990s and he told
           | me that he thought "emergent properties" were bunk._
           | 
           | He said the same thing about perceptrons in general. When it
           | comes to bunk, Minsky was... let's just say he was a subject
           | matter expert.
           | 
           | He caused a lot of people to waste a lot of time. I've got
           | your XOR _right here_ , Marvin...
        
             | PaulHoule wrote:
             | Hard to say. _Perceptrons_ was an early mathematically
             | rigorous book in CS that asked questions about what you
             | could accomplish with a family of algorithms. That said, it
             | is easy to imagine that neural networks could gave gained 5
             | years or more progress if people had had some good luck
             | early on or taken them more seriously.
        
             | QuesnayJr wrote:
             | Minsky is an example of the power of credentialism. He was
             | an MIT professor who researched AI. I think history has
             | pretty much demonstrated he didn't have any special insight
             | into the question of AI, but for decades he was the most
             | famous researcher in the field.
        
           | CryptoNoNo wrote:
           | Your example is really simple to fix.
           | 
           | Just add another model asking it if there are shoes visible
           | or not.
        
             | PaulHoule wrote:
             | That's the right track but you have to go a little further.
             | 
             | You probably need to segment out the feet and then have a
             | model that just looks at the feet. Just throwing out images
             | without feet isn't going to tell the system that it is only
             | supported to look at the feet. And even if you do that,
             | there is also the inconvenient truth that a system that
             | looks at everything else could still beat a "proper" model
             | because there are going to be cases where the view of the
             | feet is not so good and exploiting the bias is going to
             | help.
             | 
             | This weekend I might find the time to mate my image sorter
             | to a CLIP-then-classical-ML classifier and I'll see how
             | good it does. I expect it to do really well on "indoors vs
             | outdoors" but would not expect it to do well on shoes
             | (other than by cheating) unless I put a lot of effort into
             | something better.
        
         | ben_w wrote:
         | Because none of us really know what we mean by "intelligent".
         | 
         | When we see a thing with coherent writing about any subject
         | we're not experts in, _even when we notice the huge and
         | dramatic "wet pavements cause rain" level flaws when it's
         | writing about our own speciality_, we forget all those examples
         | of flaws the moment the page changes and we revert once more to
         | thinking it is a font of wisdom.
         | 
         | We've been doing this with newspapers for a century or two
         | before Michael Crichton coined the Gell-Mann Amnesia effect.
        
         | shmatt wrote:
         | There are production ready AI tools in hospitals, with FDA
         | approval. The writers of this article just decided to try a
         | non-FDA approved tool and ignore the approved ones
        
         | cmiles74 wrote:
         | There are lots of papers about using some kind of LLM to cut
         | costs and staffing in hospitals. Big companies believe there is
         | a lot of money to be made here, despite the obvious dangers.
         | 
         | A quick Google search found this paper, here's a quote:
         | 
         | "Our evaluation shows that GPT-4V excels in understanding
         | medical images and is able to generate high-quality radiology
         | reports and effectively answer questions about medical images.
         | Meanwhile, it is found that its performance for medical visual
         | grounding needs to be substantially improved. In addition, we
         | observe the discrepancy between the evaluation outcome from
         | quantitative analysis and that from human evaluation. This
         | discrepancy suggests the limitations of conventional metrics in
         | assessing the performance of large language models like GPT-4V
         | and the necessity of developing new metrics for automatic
         | quantitative analysis."
         | 
         | https://arxiv.org/html/2310.20381v5
         | 
         | For sure there's some waffling at the end, but many people will
         | come away with the feeling that this is something GPT-4V can
         | do.
        
           | CryptoNoNo wrote:
           | There is actually a lot of work involved in tracking
           | everything for an operation.
           | 
           | If an ai is able to record an operation with 9x% accuracy but
           | you save humans, the insurance might just accept this.
           | 
           | Nonetheless the chance that ai will be able to continually
           | become better and better at it, is very high.
           | 
           | Our society will switch. Instead of rewriting software or
           | updating software we will fine-tune models and add more
           | examples.
           | 
           | Because this is actually sustainable (you can reuse the old
           | data) this will win at the end.
           | 
           | The only thing changing in the future is model architecture
           | and training data will only be added.
        
         | romeros wrote:
         | because everybody is shit scared of the implications of it
         | being good enough to replace them. This is just a protective
         | defense mechanism because it threatens the status quo upon
         | which entire careers have been built.
         | 
         | "It is difficult to get a man to understand something when his
         | salary depends on his not understanding it."
        
         | 0xdeadbeefbabe wrote:
         | It's called Artificial Intelligence.
        
         | shafyy wrote:
         | Because companies like OpenAI market the shit out of them to
         | get people to believe that ChatGPT can do anything.
        
           | __loam wrote:
           | Every piece of marketing coming out of Google and Microsoft
           | is about how ai is coming and it's the future and there's
           | people still asking why people have unrealistic expectations
           | for these models.
        
         | to11mtm wrote:
         | Two reasons:
         | 
         | 1. Because you've got one or more of the below spinning it into
         | either a butterfly to chase or a product to buy:
         | 
         | - 'Research Groups' e.x. Gartner
         | 
         | - Startups with an 'AI' product
         | 
         | - Startups that add an 'AI' feature
         | 
         | - OpenAI [0]
         | 
         | 2. I'm currently working on a theory that a reasonable portion
         | of population in certain circles is viewing ChatGPT and it's
         | ilk as the perfect way to mask their long COVID symptoms and
         | thus embracing blindly. [1]
         | 
         | [0] - The level of hyuperis in some articles about ChatGPT3
         | reminded me a little too much of the fake viral news around the
         | launch of Pokemon Go, adjusted for fake viral news producers
         | improving quality of tactics. Especially because it flares up
         | when -they- do things... but others?
         | 
         | [1] - Whoever needs to read this probably won't, but; I know
         | when you had ChatGPT wrote the JIRA requirements and more
         | importantly I know when you didn't sanity check what it spit
         | out.
        
       | binarymax wrote:
       | The ECG stood out to me because my wife is a cardiologist, and
       | worked with a company iCardiac to look for specific anomalies in
       | ECGs. They were looking for LongQT to ensure clinical trials
       | didn't mess with the heart. There was a team of data scientists
       | that worked to help automate this, and they couldn't so they just
       | augmented the UI for experts - there was always a person in the
       | loop.
       | 
       | Looking at an ECG as a layperson it's a problem that seems easy
       | if you know about some tools in your math toolbox, but it's
       | deceptively hard and a false negative might mean death for the
       | patient. So, I'm not going to trust a generic vision transformer
       | model to this task, and until I see overwhelming evidence I won't
       | trust a specifically trained model for this task.
        
       | Workaccount2 wrote:
       | Soon enough we will train models with the firepower of GPT4 (5?),
       | that are purpose built and trained from the ground up to be
       | medical diagnostic tools. A hard focus on medicine with thousands
       | of hours of licensed physician RLHF. It will happen and it almost
       | certainly already underway.
       | 
       | But until it comes to fruition, I think it's largely a waste for
       | people to spend time studying the viability of general models for
       | medical tasks.
        
         | threecheese wrote:
         | Definitely. Saw a video recently mentioning the increase in
         | well-paid gigs for therapists in a metro area, which ask only
         | to record all therapist-patient interactions and treat it as
         | IP. It seems likely that the data would be part of a corpus to
         | train specialists models for psychotherapy AI, and if this kind
         | of product can actually work I don't see why every other
         | analytical profession wouldn't be targeted and well underway.
         | Lots of guesses there though, and personally I _hope_ we aren't
         | rushing into this.
        
       | ekms wrote:
       | Wish we'd get more articles from actual practitioners using
       | generative AI to do things. Nearly all the articles you see on
       | the subject are on the level of existential threats or press
       | releases, or dunking on mistakes made by LLMs. I'd really rather
       | hear a detailed writeup from professional people who used
       | generative AI to accomplish something. The only such article I've
       | run across in the wild is this one [0] from jetbrains. Anyway, if
       | anyone has any article suggestions like this please share!
       | 
       | https://blog.jetbrains.com/blog/2023/10/16/ai-graphics-at-je...
        
         | CryptoNoNo wrote:
         | I'm using chatgpt right now to create a Hugo page.
         | 
         | For whatever reason Hugo's docu is weird to get into while
         | chatgpt is shockingly good in telling me what I actually look
         | for
        
       ___________________________________________________________________
       (page generated 2024-03-26 23:02 UTC)