[HN Gopher] GPT-4V(ision) Unsuitable for Clinical Care and Educa...
___________________________________________________________________
GPT-4V(ision) Unsuitable for Clinical Care and Education: An
Evaluation
Author : PaulHoule
Score : 55 points
Date : 2024-03-26 19:19 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| zettabomb wrote:
| I question the competence of anyone using any modern AI (not just
| GPTs) for medical decisions. Beyond being capable of passing a
| multiple-choice exam (which can be done by trained monkeys)
| they're not ready for this, and they won't be for years. I guess
| confirmation of this is still good to know.
| valedan wrote:
| This seems like a pretty useless study as they don't collect any
| results from human doctors, therefore there is nothing to compare
| their GPT-4V results to.
| acchow wrote:
| Instead of comparing against some "average doctor", they used a
| few doctors as "source of truth"
|
| > All images were evaluated by two senior surgical residents
| (K.R.A, H.S.) and a board-certified internal medicine physician
| (A.T.). ECGs and clinical photos of dermatologic conditions
| were additionally evaluated by a board-certified cardiac
| electrophysiologist (A.H.) and dermatologist (A.C.),
| respectively
| hangsi wrote:
| I think the parent comment was referring to something else.
|
| In the paper the tasks are only completed by GPT-4V. For a
| valid scientific investigation, there should be a control set
| completed by e.g. qualified doctors. When the panel of
| experts does their evaluation, they should rate both sets of
| responses so that the difference in score can be compared in
| the paper.
| acchow wrote:
| Agreed. Those are different evaluations (is what I meant by
| "Instead of comparing against"). The paper cannot conclude
| that "doctors are better/more correct"
|
| It assumes that "here are 5 doctors which are always
| correct". Then measures GPT's correctness against them.
| z2 wrote:
| This is expected, right? It would be surprising if something as
| general as GPT-4V was trained on a diverse and nuanced set of
| radiology images, vs say, a traditional CNN trained and refined
| for particular detection of specific diseases. It feels akin to
| concluding that a picnic basket doesn't make a good fishing
| toolbox after all. Worse would be if someone in power was
| actually enthusiastically recommending plain GPT-V as a realistic
| solution for specialized vision tasks.
| shmatt wrote:
| This feels like it was done for the clicks and not actual
| research. There are plenty of AI startups in the radiology game,
| GPT-4V isn't free - so why is that the one being tested?
|
| Run research on models that are actually trained to solve these
| issues, that is relevant research
|
| Just as an example, Viz.ai applied[1] and received FDA
| clearance/approval for their model in hospitals.
|
| Have OpenAI ever submitted a request for use of GPT-4V? Whats
| next, try autopilot driving with GPT-4V?
|
| [1] https://www.viz.ai/news/viz-ai-receives-
| fda-510k-clearance-f...
| minimaxir wrote:
| Last week, a tweet went viral of a Claude user fighting with
| radiologists claiming the LLM found a tumor when the radiologists
| did not (most of the replies were rightfully dunking on it):
|
| > A friend sent me MRI brain scan results and I put it through
| Claude.
|
| > No other AI would provide a diagnosis, Claude did.
|
| > Claude found an aggressive tumour.
|
| > The radiologist report came back clean.
|
| > I annoyed the radiologists until they re-checked. They did so
| with 3 radiologists and their own AI. Came back clean, so looks
| like Claude was wrong.
|
| > But looks how convincing Claude sounds! We're still early...
|
| https://twitter.com/misha_saul/status/1771019329737462232
| binarymax wrote:
| This is super dangerous. It's WebMD all over again but much
| worse. It's hard enough to diagnose but now you have to fight
| against some phantom model that the patient thinks is smarter
| than their doctor.
| mateo1 wrote:
| After reading the diagnosis I caught myself wanting to examine
| the MRI to see if the bright area really exists, which means I
| fell for this too. Imagine being the person who received this
| diagnosis. Of course you're going to be concerned, even if it
| is LLM garbage.
| simonw wrote:
| Have there been many studies like this one that have been judged
| blind?
|
| I'd trust a study like this a little more if the human evaluators
| were presented with the output of GPT-4 mixed together with the
| output from human experts, such that they didn't know if the
| explanation they were evaluating came from a human or an LLM.
|
| This would reduce the risk that participants in the study,
| consciously or subconsciously, marked down AI results because
| they knew them to be AI results.
| km3r wrote:
| Healthcare is stupid expensive. Even for those with government
| provided coverage, it still costs a lot of your tax dollars.
|
| Clearly we are not there yet, but within certain conditions we
| are getting close. If we enable nurses and doctors to be more
| productive with these tools, we absolutely should explore that.
| Keep a doctor in the loop as AI is still error prone, but you can
| line things up in a way such that doctors make the decision with
| AI gathered information.
| iknowstuff wrote:
| Yeah, I don't think we're more than a decade off from GPs being
| replaceable with AI. But you can bet they will lobby hard
| against it.
| Etheryte wrote:
| This is deeply wishful thinking. We're not even remotely
| close to self driving cars and that is numerous orders of
| magnitude easier than trying to diagnose a human being with
| anything.
| cmiles74 wrote:
| IMHO, we need structural change in the way US healthcare works.
| Slathering some AI on top won't solve our cost problems but it
| will negatively impact our already declining quality of patient
| care.
| atleastoptimal wrote:
| I'm not specifically referring to this article, but in general
| I've noticed a frustrating pattern:
|
| > AI company releases generalist model for
| testing/experimentation
|
| > Users unwisely treat it like a universal oracle and give it
| tasks far outside its training domain
|
| > It doesn't perform well
|
| > People are shocked and warn about the "dangers of AI"
|
| This happens every time. Why can't we treat AI tools like they
| actually are: interesting demonstrations of emergent intelligent
| properties that are a few versions away from production-ready
| capabilities?
| pmontra wrote:
| Not in the human nature. A friend of mine just asked me how to
| make ChatGPT create a PowerPoint presentation. He meant the
| pptx file. It can't. Googling for him I learned that of course
| it can create the text and a little more surpringly even the
| VBA program that creates the slides. That's our of scope for
| that friend of mine. He was very surprised. He was like, with
| all it can do why not the slides?
| jsight wrote:
| If you down mind marp (markdown for slides), it can do a
| pretty good job of generating that.
| exe34 wrote:
| Can it output LaTeX? If so, you could try beamer.
| minimaxir wrote:
| Because the average user of AI is not a Hacker News user who
| understands its limitations, and the ones who _do_ understand
| their limitations tend to exaggerate and overhype it to make
| people think it can do anything. The only real fix for it is
| for companies like OpenAI to encourage better usage (e.g.
| tutorials), but there 's no incentive for them to do so yet.
|
| I wrote a rant a few months back about the greatest threat to
| generative AI is people using it poorly:
| https://minimaxir.com/2023/10/ai-sturgeons-law/
| PaulHoule wrote:
| Then there are the people who should know better but get
| seduced by the things, which they are really good at.
| nerdponx wrote:
| Same reason we can't trust people to drive 35 MPH when the road
| is straight and wide, no matter how many signs are posted to
| declare the speed limit. It's just too tempting and easy to
| become complacent.
|
| That, and these companies have a substantial financial interest
| in pushing the omniscience/omnipotence narrative. OpenAI trying
| to encourage responsible AI usage is like Phillip Morris trying
| to encourage responsible tobacco use. Fundamental conflict of
| interest.
| PaulHoule wrote:
| I spoke w/ Marvin Minsky once back in the 1990s and he told me
| that he thought "emergent properties" were bunk.
|
| As for the future I am certain LLMs will become more efficient
| in terms of resource consumption and easier to train, but I am
| not so certain that they're going to fundamentally solve the
| problems that LLMs have now.
|
| Try to train one to tell you what kind of shoes a person is
| wearing in an image and it will likely "short circuit" and
| conclude that a person with fancy clothes is wearing fancy
| shoes (true much more often than not) even if you can't see
| their shoes at all. (Is a person wearing a sports jersey and
| holding a basketball on a basketball court necessarily wearing
| sneakers?) This is one of those cases where bias looks like it
| is giving better performance but show a system like that a
| basketball player wearing combat boots and it will look stupid.
| So much of the apparent high performance LLMs comes out of this
| bias and I'm not sure the problem can really be fixed.
| jsight wrote:
| I doubt it would take very many suitable examples in the
| training dataset to fix problems like that.
| PaulHoule wrote:
| If it sees 90% of photos of basketball players wearing
| sneakers it is still going to get the idea that basketball
| players wear sneakers. I guess I could make up some
| synthetic data where I mix and match people's shoes but
| it's a strange way to get at the problem: train on some
| image set that has completely unrealistic statistics to try
| to make it pay attention to particular features as opposed
| to other ones.
|
| Problems like that have been bugging me for a while, if you
| had some Bayesian model you could adjust the priors to make
| the statistics match a particular set of cases and it would
| be nice if you could do the same with a wide range of
| machine learning approaches. For instance you might find
| that 90% of the cases are "easy" cases that seem like a
| waste to include in the training data, keeping there gives
| the model the right idea about the statistics but may burn
| up the learning capacity of the model such that it can't
| really learn from the other 10%.
|
| I talked w/ some contractors who made intelligence models
| for three-letter agencies and they told me all about ways
| for dealing with that come down to building multiple-staged
| models where you build a model that separates the hard
| cases from the easy cases with specialized training sets
| for each one. It's one of those things that some people
| have forgotten in the Research Triangle Park area but
| Silicon Valley never knew.
| CamperBob2 wrote:
| _I spoke w / Marvin Minsky once back in the 1990s and he told
| me that he thought "emergent properties" were bunk._
|
| He said the same thing about perceptrons in general. When it
| comes to bunk, Minsky was... let's just say he was a subject
| matter expert.
|
| He caused a lot of people to waste a lot of time. I've got
| your XOR _right here_ , Marvin...
| PaulHoule wrote:
| Hard to say. _Perceptrons_ was an early mathematically
| rigorous book in CS that asked questions about what you
| could accomplish with a family of algorithms. That said, it
| is easy to imagine that neural networks could gave gained 5
| years or more progress if people had had some good luck
| early on or taken them more seriously.
| QuesnayJr wrote:
| Minsky is an example of the power of credentialism. He was
| an MIT professor who researched AI. I think history has
| pretty much demonstrated he didn't have any special insight
| into the question of AI, but for decades he was the most
| famous researcher in the field.
| CryptoNoNo wrote:
| Your example is really simple to fix.
|
| Just add another model asking it if there are shoes visible
| or not.
| PaulHoule wrote:
| That's the right track but you have to go a little further.
|
| You probably need to segment out the feet and then have a
| model that just looks at the feet. Just throwing out images
| without feet isn't going to tell the system that it is only
| supported to look at the feet. And even if you do that,
| there is also the inconvenient truth that a system that
| looks at everything else could still beat a "proper" model
| because there are going to be cases where the view of the
| feet is not so good and exploiting the bias is going to
| help.
|
| This weekend I might find the time to mate my image sorter
| to a CLIP-then-classical-ML classifier and I'll see how
| good it does. I expect it to do really well on "indoors vs
| outdoors" but would not expect it to do well on shoes
| (other than by cheating) unless I put a lot of effort into
| something better.
| ben_w wrote:
| Because none of us really know what we mean by "intelligent".
|
| When we see a thing with coherent writing about any subject
| we're not experts in, _even when we notice the huge and
| dramatic "wet pavements cause rain" level flaws when it's
| writing about our own speciality_, we forget all those examples
| of flaws the moment the page changes and we revert once more to
| thinking it is a font of wisdom.
|
| We've been doing this with newspapers for a century or two
| before Michael Crichton coined the Gell-Mann Amnesia effect.
| shmatt wrote:
| There are production ready AI tools in hospitals, with FDA
| approval. The writers of this article just decided to try a
| non-FDA approved tool and ignore the approved ones
| cmiles74 wrote:
| There are lots of papers about using some kind of LLM to cut
| costs and staffing in hospitals. Big companies believe there is
| a lot of money to be made here, despite the obvious dangers.
|
| A quick Google search found this paper, here's a quote:
|
| "Our evaluation shows that GPT-4V excels in understanding
| medical images and is able to generate high-quality radiology
| reports and effectively answer questions about medical images.
| Meanwhile, it is found that its performance for medical visual
| grounding needs to be substantially improved. In addition, we
| observe the discrepancy between the evaluation outcome from
| quantitative analysis and that from human evaluation. This
| discrepancy suggests the limitations of conventional metrics in
| assessing the performance of large language models like GPT-4V
| and the necessity of developing new metrics for automatic
| quantitative analysis."
|
| https://arxiv.org/html/2310.20381v5
|
| For sure there's some waffling at the end, but many people will
| come away with the feeling that this is something GPT-4V can
| do.
| CryptoNoNo wrote:
| There is actually a lot of work involved in tracking
| everything for an operation.
|
| If an ai is able to record an operation with 9x% accuracy but
| you save humans, the insurance might just accept this.
|
| Nonetheless the chance that ai will be able to continually
| become better and better at it, is very high.
|
| Our society will switch. Instead of rewriting software or
| updating software we will fine-tune models and add more
| examples.
|
| Because this is actually sustainable (you can reuse the old
| data) this will win at the end.
|
| The only thing changing in the future is model architecture
| and training data will only be added.
| romeros wrote:
| because everybody is shit scared of the implications of it
| being good enough to replace them. This is just a protective
| defense mechanism because it threatens the status quo upon
| which entire careers have been built.
|
| "It is difficult to get a man to understand something when his
| salary depends on his not understanding it."
| 0xdeadbeefbabe wrote:
| It's called Artificial Intelligence.
| shafyy wrote:
| Because companies like OpenAI market the shit out of them to
| get people to believe that ChatGPT can do anything.
| __loam wrote:
| Every piece of marketing coming out of Google and Microsoft
| is about how ai is coming and it's the future and there's
| people still asking why people have unrealistic expectations
| for these models.
| to11mtm wrote:
| Two reasons:
|
| 1. Because you've got one or more of the below spinning it into
| either a butterfly to chase or a product to buy:
|
| - 'Research Groups' e.x. Gartner
|
| - Startups with an 'AI' product
|
| - Startups that add an 'AI' feature
|
| - OpenAI [0]
|
| 2. I'm currently working on a theory that a reasonable portion
| of population in certain circles is viewing ChatGPT and it's
| ilk as the perfect way to mask their long COVID symptoms and
| thus embracing blindly. [1]
|
| [0] - The level of hyuperis in some articles about ChatGPT3
| reminded me a little too much of the fake viral news around the
| launch of Pokemon Go, adjusted for fake viral news producers
| improving quality of tactics. Especially because it flares up
| when -they- do things... but others?
|
| [1] - Whoever needs to read this probably won't, but; I know
| when you had ChatGPT wrote the JIRA requirements and more
| importantly I know when you didn't sanity check what it spit
| out.
| binarymax wrote:
| The ECG stood out to me because my wife is a cardiologist, and
| worked with a company iCardiac to look for specific anomalies in
| ECGs. They were looking for LongQT to ensure clinical trials
| didn't mess with the heart. There was a team of data scientists
| that worked to help automate this, and they couldn't so they just
| augmented the UI for experts - there was always a person in the
| loop.
|
| Looking at an ECG as a layperson it's a problem that seems easy
| if you know about some tools in your math toolbox, but it's
| deceptively hard and a false negative might mean death for the
| patient. So, I'm not going to trust a generic vision transformer
| model to this task, and until I see overwhelming evidence I won't
| trust a specifically trained model for this task.
| Workaccount2 wrote:
| Soon enough we will train models with the firepower of GPT4 (5?),
| that are purpose built and trained from the ground up to be
| medical diagnostic tools. A hard focus on medicine with thousands
| of hours of licensed physician RLHF. It will happen and it almost
| certainly already underway.
|
| But until it comes to fruition, I think it's largely a waste for
| people to spend time studying the viability of general models for
| medical tasks.
| threecheese wrote:
| Definitely. Saw a video recently mentioning the increase in
| well-paid gigs for therapists in a metro area, which ask only
| to record all therapist-patient interactions and treat it as
| IP. It seems likely that the data would be part of a corpus to
| train specialists models for psychotherapy AI, and if this kind
| of product can actually work I don't see why every other
| analytical profession wouldn't be targeted and well underway.
| Lots of guesses there though, and personally I _hope_ we aren't
| rushing into this.
| ekms wrote:
| Wish we'd get more articles from actual practitioners using
| generative AI to do things. Nearly all the articles you see on
| the subject are on the level of existential threats or press
| releases, or dunking on mistakes made by LLMs. I'd really rather
| hear a detailed writeup from professional people who used
| generative AI to accomplish something. The only such article I've
| run across in the wild is this one [0] from jetbrains. Anyway, if
| anyone has any article suggestions like this please share!
|
| https://blog.jetbrains.com/blog/2023/10/16/ai-graphics-at-je...
| CryptoNoNo wrote:
| I'm using chatgpt right now to create a Hugo page.
|
| For whatever reason Hugo's docu is weird to get into while
| chatgpt is shockingly good in telling me what I actually look
| for
___________________________________________________________________
(page generated 2024-03-26 23:02 UTC)