[HN Gopher] Vision language models are blind
___________________________________________________________________
Vision language models are blind
Author : taesiri
Score : 428 points
Date : 2024-07-10 13:35 UTC (1 days ago)
(HTM) web link (vlmsareblind.github.io)
(TXT) w3m dump (vlmsareblind.github.io)
| taesiri wrote:
| This paper examines the limitations of current vision-based
| language models, such as GPT-4 and Sonnet 3.5, in performing low-
| level vision tasks. Despite their high scores on numerous
| multimodal benchmarks, these models often fail on very basic
| cases. This raises a crucial question: are we evaluating these
| models accurately?
| rezaghanbari1 wrote:
| Some of these samples are shocking. How do these models answer
| chart-based questions, I mean when they can't even count the
| intersections between two lines?
| RodgerTheGreat wrote:
| Same way they answer any question: piece together a
| statistically probable sequence of words to follow the prompt.
| All they know about an image is a handful of words a classifier
| might choose to describe it. If those words have nothing to do
| with the question being asked, they can't nudge the model in
| the general direction of a correct answer, so it's a crapshoot-
| even moreso than usual.
| imtringued wrote:
| The dataset most likely contains chart descriptions that
| describe the raw data, but not the visual interactions of the
| individual pixels.
| dheera wrote:
| Current approaches of multi-modal models work on embeddings and
| tokenizations of images, which is the fundamental problem: you
| are feeding blurry, non-precise data into the model. Yes, they
| are blind because of exactly this.
|
| An embedding isn't conceptually that much different from feeding
| a 1024-word description of an image instead of the actual image.
|
| At the moment compute power isn't good enough to feed high-res
| pixel data into these models, unless we discover a vastly
| different architecture, which I am also convinced likely exists.
| jayd16 wrote:
| Doesn't Gemini have a 2 million token limit for exactly this?
| diwank wrote:
| The number of tokens _per image_ are actually fairly small,
| ranging from 85 to ~500.
| visarga wrote:
| > An embedding isn't conceptually that much different from
| feeding a 1024-word description of an image instead of the
| actual image.
|
| An embedding needs less words. You can embed individual words,
| phrases, like a whole prompt and longer paragraphs. You don't
| need 1024 words for a text embed. For example a famous library
| is called Sentence BERT (sbert).
|
| When you embed images on the other hand, you cut them up into
| little squares on the tune of 32x32 px, and embeds one of them
| separately. chatGPT uses something like 250 tokens for smaller
| images. So a smaller image costs about as much as 200 words if
| represented graphically, and maybe much less words if you embed
| a text description of it.
| dheera wrote:
| > needs less words
|
| Yes I'm aware of this, and work in ML -- the thing is
| embeddings are not designed for faithful image
| reconstruction, and aren't even trained that way. You can
| easily find two images that have substantially similar CLIP
| (or whatever) embeddings that are visually very different. If
| you query the LLM about that difference, the LLM wouldn't
| even have the information to differentiate answers for the
| two images if you only supply it with the embedding.
|
| On the other hand, SDXL autoencoder latents passed into an
| LLM alongside the embedding _might_ be a step up from just an
| image embedding, since they are designed for image
| reconstruction, but I don 't have access to the compute or
| data resources to attempt training this.
| cs702 wrote:
| Wow, that is _embarrassingly bad performance_ for current SOTA
| models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are
| advertised and sold as being able to understand images, e.g., for
| guiding the blind or tutoring children in geometry!
|
| The tasks at which they fail are ridiculously simple for human
| beings, including, for example:
|
| * counting the number of times two lines intersect;
|
| * detecting whether two circles overlap;
|
| * selecting which letter is being circled in a word;
|
| * counting the number of circles in an Olympic-like logo.
|
| This should be at the top of the front page.
| tensor wrote:
| I don't see how this is "embarrassing" in the slightest. These
| models are not human brains, and the fact that people equate
| them with human brains is an embarrassing failure of the humans
| more than anything about the models.
|
| It's entirely unsurprising that there are numerous cases that
| these models can't handle that are "obvious to humans." Machine
| learning has had this property since its invention and it's a
| classic mistake humans make dealing with these systems.
|
| Humans assume that because a machine learning model has above
| human accuracy on task X that it implies that it must also have
| that ability at all the other tasks. While a human with amazing
| ability at X would indeed have amazing abilities at other
| tasks, this is not true of machine learning models The opposite
| thinking is also wrong, that because the model can't do well on
| task Y it must be unreliable and it's ability on task X is
| somehow an illusion and not to be trusted.
| cs702 wrote:
| It is embarrassingly, shockingly bad, because these models
| are _advertised_ and _sold_ as being capable of understanding
| images.
|
| Evidently, all these models still fall short.
| kristjansson wrote:
| It's surprising because these models are pretty ok at some
| vision tasks. The existence of a clear failure mode is
| interesting and informative, not embarrassing.
| knowaveragejoe wrote:
| Not only are they capable of understanding images(the kind
| people might actually feed into such a system -
| photographs), but they're pretty good at it.
|
| A modern robot would struggle to fold socks and put them in
| a drawer, but they're great at making cars.
| pixl97 wrote:
| I mean, with some of the recent demos, robots have got a
| lot better at folding stuff and putting it up. Not saying
| it's anywhere close to human level, but it has taken a
| pretty massive leap from being a joke just a few years
| ago.
| startupsfail wrote:
| Humans are also shockingly bad on these tasks. And guess
| where the labeling was coming from...
| simonw wrote:
| I see this complaint about LLMs all the time - that they're
| advertised as being infallible but fail the moment you give
| them a simple logic puzzle or ask for a citation.
|
| And yet... every interface to every LLM has a "ChatGPT can
| make mistakes. Check important info." style disclaimer.
|
| The hype around this stuff may be deafening, but it's often
| not entirely the direct fault of the model vendors
| themselves, who even put out lengthy papers describing
| their many flaws.
| jazzyjackson wrote:
| There's evidently a large gap between what researchers
| publish, the disclaimers a vendor makes, and what gets
| broadcast on CNBC, no surprise there.
| jampekka wrote:
| A bit like how Tesla Full Self-Driving is not to be used
| as self-driving. Or any other small print. Or ads in
| general. Lying by deliberately giving the wrong
| impression.
| verdverm wrote:
| It would have to be called ChatAGI to be like TeslaFSD,
| where the company named it something it is most
| definitely not
| fennecbutt wrote:
| Why do people expect these models, designed to be humanlike
| in their training, to be 100% perfect?
|
| Humans fuck up all the time.
| TeMPOraL wrote:
| They're hardly being advertised or sold on that premise.
| They advertise and sell themselves, because _people try
| them out and find out they work_ , and tell their friends
| and/or audiences. ChatGPT is probably the single biggest
| bona-fide organic marketing success story in recorded
| history.
| foldr wrote:
| This is fantastic news for software engineers. Turns out
| that all those execs who've decided to incorporate AI
| into their product strategy have already tried it out and
| ensured that it will actually work.
| ben_w wrote:
| > Turns out that all those execs who've decided to
| incorporate AI into their product strategy have already
| tried it out and ensured that it will actually work.
|
| The 2-4-6 game comes to mind. They may well have
| _verified_ the AI will work, but it 's hard to learn the
| skill of thinking about how to _falsify_ a belief.
| mrbungie wrote:
| These models are marketed as being able to guide the blind or
| tutoring children using direct camera access.
|
| Promoting those use cases and models failing in these ways is
| irresponsible. So, yeah, maybe the models are not embarrasing
| but the hype definitely is.
| cs702 wrote:
| _> Promoting those use cases and models failing in these
| ways is irresponsible._
|
| Yes, _exactly_.
| scotty79 wrote:
| You'd expect them to be trained on simple geometry since you
| can create arbitrarily large synthetic training set for that.
| sfink wrote:
| Well said.
|
| It doesn't matter how they are marketed or described or held
| up to some standard generated by wishful thinking. And it
| especially doesn't matter what it would mean if a human were
| to make the same error.
|
| It matters what they are, what they're doing, and how they're
| doing it. Feel free to be embarrassed if _you_ are claiming
| they can do what they can 't and are maybe even selling them
| on that basis. But there's nothing embarrassing about their
| current set of capabilities. They are very good at what they
| are very good at. Expecting those capabilities to generalize
| as they would if they were human is like getting embarrassed
| that your screwdriver can't pound in a nail, when it is ever
| so good at driving in screws.
| insane_dreamer wrote:
| > is an embarrassing failure of the humans more than anything
| about the models
|
| No, it's a failure of the companies who are advertising them
| as capable of doing something which they are not (assisting
| people with low vision)
| simonw wrote:
| But they CAN assist people with low vision. I've talked to
| someone who's been using a product based on GPT-4o and
| absolutely loves it.
|
| Low vision users understand the limitations of
| accessibility technology better than anyone else. They will
| VERY quickly figure out what this tech can be used for
| effectively and what it can't.
| drodgers wrote:
| I can't help but read comments like this as:
|
| "My talking dog always makes mistakes on calculus problems: how
| embarrassingly bad!"
|
| Has the expectation treadmill really advanced so quickly that
| sub-human performance on any category of problems is now an
| embarrassment?
| aezart wrote:
| To me I guess it suggests that these models are not using the
| correct approach. We keep finding new types of tasks the
| models are bad at, then the next model fixes those issues
| because those specific tasks are added to the training set.
| But that approach never results in a generalized problem
| solving ability, just an ability to solve all the problems
| we've thought of so far.
| sweezyjeezy wrote:
| Entertaining, but I think the conclusion is way off.
|
| > their vision is, at best, like that of a person with myopia
| seeing fine details as blurry
|
| is a crazy thing to write in an abstract. Did they try to probe
| that hypothesis at all? I could (well actually I can't) share
| some examples from my job of GPT-4v doing some pretty difficult
| fine-grained visual tasks that invalidate this.
|
| Personally, I rate this paper [1], which makes the argument that
| these huge GenAI models are pretty good at things - _assuming
| that it has seen a LOT of that type of data during training_
| (which is true of a great many things). If you make up tasks like
| this, then yes can be REALLY bad at them, and initial impressions
| of AGI get harder to justify. But in practice, we aren 't just
| making up tasks to trip up these models. They can be very
| performant on some tasks and the authors have not presented any
| real evidence about these two modes.
|
| [1] https://arxiv.org/abs/2404.04125
| diwank wrote:
| Yeah I think their findings are def interesting but the title
| and the strong claims are a tad hyperbolic.
| SomaticPirate wrote:
| There are quite a few "ai apologists" in the comments but I
| think the title is fair when these models are marketed towards
| low vision people ("Be my eyes"
| https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent
| to human vision. These models are implied to be human level
| equivalents when they are not.
|
| This paper demonstrates that there are still some major gaps
| where simple problems confound the models in unexpected ways.
| These is important work to elevate otherwise people may start
| to believe that these models are suitable for general
| application when they still need safeguards and copious
| warnings.
| sweezyjeezy wrote:
| The paper I linked should hopefully mark me out as far from
| an AI apologist, it's actually really bad news for GenAI if
| correct. All I mean to say is the clickbait conclusion and
| the evidence do not match up.
| Melomololotolo wrote:
| We have started the ara of ai.
|
| It really doesn't matter how good current llms are.
|
| They have been good enough to start this ara.
|
| And no it's not and never has been just llms. Look what
| Nvidia is doing with ml.
|
| Whisper huge advantage, segment anything again huge. Alpha
| fold 2 again huge.
|
| All the robot announcements -> huge
|
| I doubt we will reach agi just through llms. We will reach
| agi through multi modal, mix of experts, some kind of
| feedback loop, etc.
|
| But the stone started to roll.
|
| And you know I prefer to hear about ai advantages for the
| next 10-30 years. That's a lot better than the crypto shit
| we had the last 5 years.
| zwnow wrote:
| We won't reach agi in our lifetimes.
| pixl97 wrote:
| Yea, really if you look at human learning/seeing/acting there
| is a feedback loop that LLM for example isn't able to
| complete and train on.
|
| You see an object. First you have to learn how to control all
| your body functions to move toward it and grasp it. This
| teaches you about the 3 dimensional world and things like
| gravity. You may not know the terms, but it is baked in your
| learning model. After you get an object you start building a
| classification list "hot", "sharp", "soft and fuzzy",
| "tasty", "slick". Your learning model builds up a list of
| properties of objects and "expected" properties of objects.
|
| Once you have this 'database' you create as a human, you can
| apply the logic to achieve tasks. "Walk 10 feet forward, but
| avoid the sharp glass just to the left". You have to have
| spatial awareness, object awareness, and prediction ability.
|
| Models 'kind of' have this, but its seemingly haphazard, kind
| of like a child that doesn't know how to put all the pieces
| together yet. I think a lot of embodied robot testing where
| the embodied model feeds back training to the LLM/vision
| model will have to occur before this is even somewhat close
| to reliable.
| TeMPOraL wrote:
| Embodied is useful, but I think not necessary _even if_ you
| need learning in a 3D environment. _Synthesized embodiment_
| should be enough. While in some cases[0] it may have
| problems with fidelity, simulating embodied experience _in
| silico_ scales much better, and more importantly, _we have
| control over time flow_. Humans always learn in real-time,
| while with simulated embodiment, we could cram _years_ of
| subjective-time experiences into a model in seconds, and
| then for novel scenarios, spend an hour per each second of
| subjective time running a high-fidelity physics
| simulation[1].
|
| --
|
| [0] - Like if you plugged a 3D game engine into the
| training loop.
|
| [1] - Results of which we could hopefully reuse in training
| later. And yes, a simulation could itself be a recording of
| carefully executed experiment in real world.
| pegasus wrote:
| > Like if you plugged a 3D game engine into the training
| loop
|
| Isn't this what _synthesized embodiment_ basically always
| is? As long as the application of the resulting
| technology is in a restricted, well controlled
| environment, as is the case for example for an assembly-
| line robot, this is a great strategy. But I expect
| fidelity problems will make this technique ultimately a
| bad idea for anything that 's supposed to interact with
| humans. Like self-driving cars, for example. Unless,
| again, those self-driving cars are segregated in
| dedicated lanes.
| taneq wrote:
| > Humans always learn in real-time
|
| In the sense that we can't fast-forward our offline
| training, sure, but humans certainly "go away and think
| about it" after gaining IRL experience. This process
| seems to involve both consciously and subconsciously
| training on this data. People often consciously think
| about recent experiences, run through imagined scenarios
| to simulate the outcomes, plan approaches for next time
| etc. and even if they don't, they'll often perform better
| at a task after a break than they did at the start of the
| break. If this process of replaying experiences and
| simulating variants of them isn't "controlling the flow
| of (simulated) time" I don't know what else you'd call
| it.
| Melomololotolo wrote:
| Ah yes the blind person who constantly needs to know if two
| lines intersect.
|
| Let's just ignore what a blind person normally needs to know.
|
| You know what blind people ask? Sometimes there daily routine
| is broken because there is some type of construction and
| models can tell you this.
|
| Sometimes they need to read a basic sign and models can do
| this.
|
| Those models help people already and they will continue to
| get better.
|
| I'm not sure if I'm more frustrated how condescending the
| authors are or your ignorance.
|
| Valid criticism doesn't need to be shitty
| shagie wrote:
| As an aside... from 2016 this is what was a valid use case
| for a blind person with an app.
|
| Seeing AI 2016 Prototype - A Microsoft research project -
| https://youtu.be/R2mC-NUAmMk
|
| https://www.seeingai.com are the actual working apps.
|
| The version from 2016 I recall showing (pun not intended)
| to a coworker who had some significant vision impairments
| and he was _really_ excited about what it could do back
| then.
|
| ---
|
| I still remain quite impressed with its ability to parse
| the picture and likely reason behind it
| https://imgur.com/a/JZBTk2t
| benreesman wrote:
| If we're throwing "citation needed" tags on stuff, how about
| the first sentence?
|
| "Large language models with vision capabilities (VLMs), e.g.,
| GPT-4o and Gemini-1.5 Pro are powering countless image-text
| processing applications"
|
| I don't know how many a "countless" is, but I think we've
| gotten really sloppy in terms of what counts for LLMs as a
| demonstrated, durable win in a concrete task attached to
| well-measured outcomes and holding up over even modest
| periods of time.
|
| This stuff is really promising and lots of builders are
| making lots of nifty things, so if that counts as an
| application then maybe we're at countless, but in the
| enterprise and in government and in refereed academic
| literature we seem to be at the proof-of-concept phase.
| Impressive chat bots as a use case are pretty dialed in,
| enough people claim that they help with coding that I tend to
| believe it's a real thing (I never seem to come out ahead of
| going directly to the source, StackOverflow).
|
| The amount of breathless press on this seems "countless", so
| maybe I missed the totally rigorous case study on how X
| company became Y percent more profitable by doing Z thing
| with LLMs (or similar), and if so I'd be grateful for
| citations, but neither Google nor any of the big models seem
| to know about it.
| dsr_ wrote:
| "maybe I missed the totally rigorous case study on how X
| company became Y percent more profitable by doing Z thing
| with LLMs (or similar), and if so I'd be grateful for
| citations, but neither Google nor any of the big models
| seem to know about it."
|
| Goldman Sachs recently issued a report.
|
| https://www.goldmansachs.com/intelligence/pages/gs-
| research/...
|
| "We estimate that the AI infrastructure buildout will cost
| over $1tn in the next several years alone, which includes
| spending on data centers, utilities, and applications. So,
| the crucial question is: What $1tn problem will AI solve?
| Replacing low- wage jobs with tremendously costly
| technology is basically the polar opposite of the prior
| technology transitions I've witnessed in my thirty years of
| closely following the tech industry"
| Lerc wrote:
| I disagree. I think the title, abstract, and conclusion not
| only misrepresents the state of the models but it
| misrepresents Thier own findings.
|
| They have identified a class of problems that the models
| perform poorly at and have given a good description of the
| failure. They portray this as a representative example of the
| behaviour in general. This has not been shown and is probably
| not true.
|
| I don't think that models have been portrayed as equivalent
| to humans. Like most AI in it has been shown as vastly
| superior in some areas and profoundly ignorant in others.
| Media can overblow things and enthusiasts can talk about
| future advances as if they have already arrived, but I don't
| think these are typical portayals by the AI Field in general.
| youssefabdelm wrote:
| Exactly... I've found GPT-4o to be good at OCR for
| instance... doesn't seem "blind" to me.
| spookie wrote:
| You don't really need a LLM for OCR. Hell, I suppose they
| just run a python script in its VM and rephrase the
| output.
|
| At least that's what I would do. Perhaps the script would
| be a "specialist model" in a sense.
| Spivak wrote:
| It's not that you need an LLM for OCR but the fact that
| an LLM can do OCR (and handwriting recognition which is
| much harder) despite not being made specifically for that
| purpose is indicative of something. The jump from knowing
| "this is a picture of a paper with writing on it" like
| what you get with CLIP to being able to reproduce what's
| on the paper is, to me, close enough to seeing that the
| difference isn't meaningful anymore.
| acheong08 wrote:
| GPT-4v is provided with OCR
| letmevoteplease wrote:
| No reason to believe that. Open source VLMs can do
| OCR.[1]
|
| [1] https://huggingface.co/spaces/opencompass/open_vlm_le
| aderboa...
| simonw wrote:
| That's a common misconception.
|
| Sometimes if you upload an image to ChatGPT and ask for
| OCR it will run Python code that executes Tesseract, but
| that's effectively a bug: GPT-4 vision works much better
| than that, and it will use GPT-4 vision if you tell it
| "don't use Python" or similar.
| Foobar8568 wrote:
| I know that GPT-4o is fairly poor to recognize music
| sheets and notes. Totally off the marks, more often than
| not, even the first note is not recognize on a first week
| solfege book.
|
| So unless I missed something but as far as I am
| concerned, they are optimized for benchmarks.
|
| So while I enjoy gen AI, image-to-text is highly subpart.
| youssefabdelm wrote:
| Useful to know, thank you!
| stavros wrote:
| Most adults with 20/20 vision will also fail to recognize
| the first note on a first week solfege book.
| prmoustache wrote:
| Well maybe not blind but the analogy with myopia might
| stand.
|
| For exemple in the case of OCR, a person with myopia will
| usually be able to make up letters and words even without
| his glasses based on his expectation (similar to vlm
| training) of seeing letters and words in, say, a sign. He
| might not see them all clearly and do some errors but
| might recognize some letters easily and make up the rest
| based on context, words recognition, etc. Basically
| experience.
|
| I also have a funny anecdote about my partner, which has
| sever myopia, who once found herself outside her house
| without her glasses on, and saw something on the grass
| right in front. She told her then brother in law "look, a
| squirrel" Only for the "squirrel" to take off while
| shouting its typical caws. It was a crow. This is typical
| of VLM's hallucinations.
| subroutine wrote:
| I think the conclusion of the paper is far more mundane.
| It's curious that VLM can recognize complex novel objects
| in a trained category, but cannot perform basic visual
| tasks that human toddlers can perform (e.g. recognizing
| when two lines intersect or when two circles overlap).
| Nevertheless I'm sure these models can explain in great
| detail what intersecting lines are, and even what they look
| like. So while LLMs might have image processing
| capabilities, they clearly do not see the way humans _see_.
| That, I think, would be a more apt title for their
| abstract.
| kenjackson wrote:
| Simple is a relative statement. There are vision problems
| where monkeys are far better than humans. Some may look at
| human vision and memory and think that we lack basic skills.
|
| With AINwe are creating intelligence but with different
| strengths and weaknesses. I think we will continue to be
| surprised at how well they work on some problems and how poor
| they do at some "simple" ones.
| lynx23 wrote:
| Be My Eyes user here. I disagree with your uninformed
| opinion. Be My Eyes is more often than not more useful then a
| human. And I am reporting from personal experience. What
| experience do you have?
| brookst wrote:
| I don't see Be My Eyes or other similar efforts as "implied"
| to be equivalent to humans at all. They're just new tools
| which can be very useful for some people.
|
| "These new tools aren't perfect" is the dog bites man story
| of technology. It's certainly true, but it's no different
| than GPS ("family drives car off cliff because GPS said to").
| dartos wrote:
| Based take
| FrenchDevRemote wrote:
| > their vision is, at best, like that of a person with myopia
| seeing fine details as blurry
|
| It's not that far from reality, most models sees images in very
| low resolution/limited colors, so not so far from this
| description
| blackmesaind wrote:
| My thoughts as well. I too would have trouble with the
| overlapping lines tests if all the images underwent
| convolution.
| vikramkr wrote:
| They didn't test that claim at all though. Vision isn't some
| sort of 1D sliding scale with every vision condition lying
| along one axis.
|
| First of all myopia isn't 'seeing fine details as blurry' -
| it's nearsightedness - and whatever else this post tested it
| definitely didn't test depth perception.
|
| And second - inability to see fine details is a
| distinct/different thing from not being able to count
| intersections and the other things tested here. That
| hypothesis, if valid, would imply that improving the
| resolution of the image that the model can process would
| improve its performance on these tasks even if reasoning
| abilities were the same. That - does not make sense. Plenty
| of the details in these images that these models are tripping
| up on are perfectly distinguishable at low resolutions.
| Counting rows and columns of blank grids is not going to
| improve with more resolution.
|
| I mean, I'd argue that the phrasing of the hypothesis ("At
| best, like that of a person with myopia") doesn't make sense
| at all. I don't think a person with myopia would have any
| trouble with these tasks if you zoomed into the relevant
| area, or held the image close. I have a very strong feeling
| that these models would continue to suffer on these tasks if
| you zoomed in. Nearsighted != unable to count squares.
| necovek wrote:
| It seems to me they've brought up myopia only to make it
| more approachable to people how blurry something is,
| implying they believe models work with a blurry image just
| like a nearsighted person sees blurry images at a distance.
|
| While myopia is common, it's not the best choice of analogy
| and "blurry vision" is probably clear enough.
|
| Still, I'd only see it as a bad choice of analogy -- I
| can't imagine anyone mistaking optical focus problems for
| static image processing problems -- so in the usual HN
| recommendation, I'd treat their example in the most
| favourable sense.
| jrflowers wrote:
| >I could (well actually I can't)
|
| I like the idea that these models are so good at some sort of
| specific and secret bit of visual processing that things like
| "counting shapes" and "beating a coin toss for accuracy"
| shouldn't be considered when evaluating them.
| vikramkr wrote:
| Those don't really have anything to do with fine
| detail/nearsightedness. What they measured is
| valid/interesting - what they concluded is unrelated.
| valine wrote:
| LLMs are bad at counting things just in general. It's hard to
| say whether the failures here are vision based or just an
| inherent weakness of the language model.
| godelski wrote:
| > Did they try to probe that hypothesis at all?
|
| I think this is a communication issue and you're being a bit
| myopic in your interpretation. It is clearly an analogy meant
| for communication and is not an actual hypothesis. Sure, they
| could have used a better analogy and they could have done other
| tests, but the paper still counters quite common claims (from
| researchers) about VLMs.
|
| > I could (well actually I can't) share some examples from my
| job of GPT-4v doing some pretty difficult fine-grained visual
| tasks that invalidate this.
|
| I find it hard to believe that there is no example you can
| give. It surely doesn't have to be exactly your training data.
| If it is this good, surely you can create an example no
| problem. If you just don't want to, that's okay, but then don't
| say it.
|
| But I have further questions. Do you have complicated
| prompting? Or any prompt engineering? It sure does matter how
| robust these models are to prompting. There's a huge difference
| between a model being able to accomplish a task and a model
| being able to perform a task in a non-very-specific
| environment. This is no different than something working in a
| tech demo and not in the hand of the user.
|
| > But in practice, we aren't just making up tasks to trip up
| these models.
|
| I see this sentiment quite often and it is baffling to me.
|
| First off, these tasks are not clearly designed to trick these
| models. A model failing at a task is not suddenly "designed to
| trick a model." Its common with the river crossing puzzles
| where they're rewritten to be like "all animals can fit in the
| boat." If that is "designed to trick a model", then the model
| must be a stochastic parrot and not a generalist. It is very
| important that we test things where we do know the answer to
| because, unfortunately, we're not clairvoyant and can't test
| questions we don't know the answer to. Which is the common case
| in the real world usage.
|
| Second, so what if a test was designed to trick up a model?
| Shouldn't we be determining when and where models fail? Is that
| not a critical question in understanding how to use them
| properly? This seems doubly important if they are tasks that
| humans don't have challenges with them.
|
| > They can be very performant on some tasks and the authors
| have not presented any real evidence about these two modes.
|
| I don't think people are claiming that large models can't be
| performant on some tasks. If they are, they're rejecting
| trivially verifiable reality. But not every criticism and has
| to also contain positive points. There's plenty of papers and a
| lot of hype already doing that. And if we're going to be
| critical of anything, shouldn't it be that the companies
| creating these models -- selling them, and even charging
| researchers to perform these types of experiments that the can
| and are used to improve their products -- should be much more
| clear about the limitations of their models? If we need
| balance, then I think there's bigger fish to fry than Auburn
| and Alberta Universities.
| orbital-decay wrote:
| _> I think this is a communication issue and you 're being a
| bit myopic in your interpretation. It is clearly an analogy
| meant for communication and is not an actual hypothesis._
|
| I don't know, words have meanings. If that's a communication
| issue, it's on part of the authors. To me, this wording in a
| what is supposed to be a research paper abstract clearly
| suggests the insufficient resolution as the cause. How else
| should I interpret it?
|
| _> The shockingly poor performance of four state-of-the-art
| VLMs suggests their vision is, at best, like that of a person
| with myopia seeing fine details as blurry_
|
| And indeed, increasing the resolution is expensive, and the
| best VLMs have something like 1000x1000. But the low
| resolution is clearly not the issue here, and the authors
| don't actually talk about it in the paper.
|
| _> I find it hard to believe that there is no example you
| can give._
|
| I'm not the person you're answering to, but I actually lazily
| tried two of authors' examples in a less performant VLM
| (CogVLM), and was surprised it passed those, making me wonder
| whether I can trust their conclusions until I reproduce their
| results. LLMs and VLMs have all kinds of weird failure modes,
| it's not a secret they fail at some trivial tasks and their
| behavior is still not well understood. But working with these
| models and narrowing it down is notoriously like trying to
| nail a jelly to the wall. If I was able to do this in a
| cursory check, what else is there? More than one research
| paper in this area is wrong from the start.
| godelski wrote:
| > I don't know, words have meanings.
|
| That's quite true. Words mean exactly what people agree
| upon them meaning. Which does not require everyone, or else
| slang wouldn't exist. Nor the dictionary, which
| significantly lags. Regardless, I do not think this is even
| an unusual use of the word, though I agree the mention of
| myopia is. The usage makes sense if you consider that both
| myopic and resolution have more than a singular meaning.
| Myopic: lacking in foresight or __discernment__ :
| narrow in perspective and without concern for broader
| implications Resolution: the process or
| capability of making distinguishable the individual parts
| of an object, closely adjacent optical images, or sources
| of light
|
| I agree that there are far better ways to communicate. But
| my main gripe is that they said it was "their hypothesis."
| If reading the abstract as a whole, I find it an odd
| conclusion to come to. It doesn't pair with the words that
| follow with blind guessing (and I am not trying to defend
| the abstract. It is a bad abstract). But if you read the
| intro and look at the context of their landing page, I find
| it quite difficult to come to this conclusion. It is poorly
| written, but it is still not hard to decode the key
| concepts the authors are trying to convey.
|
| I feel the need to reiterate that language has 3 key
| aspects to it: the concept attempted to be conveyed, the
| words that concept is lossy encoded into, and the lossy
| decoding of the person interpreting it. Communication
| doesn't work by you reading/listening to words and looking
| up those words in a dictionary. Communication is a problem
| where you use words (context/body language/symbols/etc) to
| decrease the noise and get the reciever to reasonably
| decode your intended message. And unfortunately we're in a
| global world and many different factors, such as culture,
| greatly affect how one encodes and/or decodes language. It
| only becomes more important to recognize the fuzziness
| around language here. Being more strict and leaning into
| the database view of language only leads to more errors.
|
| > But the low resolution is clearly not the issue here, and
| the authors don't actually talk about it in the paper.
|
| Because they didn't claim that image size and sharpness was
| an issue. They claimed the VLM cannot resolve the images
| "as if" they were blurry. Determining what the VLM actually
| "sees" is quite challenging. And I'll mention that arguably
| they did test some factors that relate to blurriness. Which
| is why I'm willing to overlook the poor analogy.
|
| > I actually lazily tried two of authors' examples in a
| less performant VLM (CogVLM), and was surprised it passed
| those
|
| I'm not. Depending on the examples you pulled, 2 random
| ones passing isn't unlikely given the results.
|
| Something I generally do not like about these types of
| papers is that they often do not consider augmentations.
| Since these models tend to be quite sensitive to both the
| text (prompt) inputs and image inputs. This is quite common
| in generators in general. Even the way you load in and
| scale an image can have significant performance
| differences. I've seen significant differences in simple
| things like loading an image from numpy, PIL, tensorflow,
| or torch have different results. But I have to hand it to
| these authors, they looked at some of this. In the appendix
| they go through with confusion matrices and look at the
| factors that determine misses. They could have gone deeper
| and tried other things, but it is a more than reasonable
| amount of work for a paper.
| ClumsyPilot wrote:
| > Second, so what if a test was designed to trick up a model?
| Shouldn't we be determining when and where models fail? Is
| that not a critical question in understanding how to use them
| properly?
|
| People are rushing to build this AI into all kinds of
| products, and they actively don't want to know where the
| problems are.
|
| The real world outside is designed to trip up the model.
| Strange things happen all the time.
|
| Because software developers have no governing body, no oaths
| of ethics and no spine someone will end up dead in a ditch
| from malfunctioning AI.
| TeMPOraL wrote:
| > _The real world outside is designed to trip up the model.
| Strange things happen all the time._
|
| Counterpoint: real world is heavily sanitized towards
| things that don't trip human visual perception up too much,
| or otherwise inconvenience us. ML models are trained on
| that, and for that. They're _not_ trained for dealing with
| synthetic images, that couldn 't possibly exist in reality,
| _and_ designed to trip visual processing algorithms up.
|
| Also:
|
| > _People are rushing to build this AI into all kinds of
| products, and they actively don't want to know where the
| problems are._
|
| Glass half-full (of gasoline) take: those products will
| trip over real-world problems, identifying them in the
| process, and the models will get better walking over the
| corpses of failed AI-get-rich-quick companies. The people
| involved may not want to know where the problems are, but
| by deploying the models, they'll reveal those problems to
| all.
|
| > _Because software developers have no governing body, no
| oaths of ethics and no spine someone will end up dead in a
| ditch from malfunctioning AI._
|
| That, unfortunately, I 100% agree with. Though AI isn't
| special here - not giving a fuck kills people regardless of
| the complexity of software involved.
| godelski wrote:
| > They're not trained for dealing with synthetic images,
| that couldn't possibly exist in reality, and designed to
| trip visual processing algorithms up
|
| Neither of these claims are true. ML is highly trained on
| synthetic images. In fact, synthetic data generation is
| the way forward for the scale is all you need people. And
| there are also loads of synthetic images out in the wild.
| Everything from line art to abstract nonsense. Just take
| a walk down town near the bars.
|
| > not giving a fuck kills people regardless of the
| complexity of software involved.
|
| What has me the most frustrated is that this "move fast
| break things and don't bother cleaning up" attitude is
| not only common in industry but also in academia. But
| these two are incredibly intertwined these days and it's
| hard to publish without support from industry because
| people only evaluate on benchmarks. And if you're going
| to hack your benchmarks, you just throw a shit ton of
| compute at it. Who cares where the metrics fail?
| ben_w wrote:
| > Because software developers have no governing body, no
| oaths of ethics and no spine someone will end up dead in a
| ditch from malfunctioning AI.
|
| The conclusion and the premise are both true, but not the
| causality. On AI, the Overton window is mostly filled with
| people going "this could be very bad if we get it wrong".
|
| Unfortunately, there's _enough_ people who think "unless I
| do it first" (Musk, IMO) or "it can't possibly be harmful"
| (LeCun) that it will indeed kill more people than it
| already has.
|
| The number who are already (and literally) "dead in a
| ditch" is definitely above zero if you include all the
| things that used to be AI when I was a kid e.g. "route
| finding": https://www.cbsnews.com/news/google-sued-
| negligence-maps-dri...
| itkovian_ wrote:
| I think gpt4o is probably doing some ocr as preprocessing. It's
| not really controversial to say the vmls today don't pick up
| fine grained details - we all know this. Can just look at the
| output of a vae to know this is true.
| thomasahle wrote:
| If so, it's better than any other ocr on the market.
|
| I think they just train it on a bunch of text.
|
| Maybe counting squares in a grid was not probably considered
| important enough to train for.
| _flux wrote:
| Why do you think it's probable? The much smaller llava that I
| can run in my consumer GPU can also do "OCR", yet I don't
| believe anyone has hidden any OCR engine inside llama.cpp.
| TeMPOraL wrote:
| Entertaining is indeed the right word. Nice job identifying
| corner cases of models' visual processing; curiously, they're
| not far conceptually from some optical illusions that reliably
| trip humans up. But to call the models "blind" or imply their
| low performance in general? That's _trivially invalidated_ by
| just _taking your phone out and feeding a photo to ChatGPT
| app_.
|
| Like, seriously. One poster below whines about "AI apologists"
| and BeMyEyes, but again, it's all trivially testable with your
| phone and $20/month subscription. It works spectacularly well
| on _real world tasks_. Not perfectly, sure, but good enough to
| be useful _in practice_ and better than alternatives (which
| often don 't exist).
| csomar wrote:
| > these huge GenAI models are pretty good at things
|
| Is this the sales pitch though? Because 15 years ago, I had a
| scanner with an app that can scan a text document and produce
| the text on Windows. The machine had something like 256Mb of
| RAM.
|
| Tech can be _extremely_ good at niches in isolation. You can
| have an OCR system 10 years ago and it 'll be extremely
| reliable at the single task it's configured to do.
|
| AI is supposed to bring a new paradigm, where the tech is not
| limited to the specific niche the developers have scoped it to.
| However, if it _reliably_ fails to detect simple things a
| regular person should not get wrong, then the whole value
| proposition is kicked out of the window.
| kalkin wrote:
| There's definitely something interesting to be learned from the
| examples here - it's valuable work in that sense - but "VLMs
| are blind" isn't it. That's just clickbait.
| diwank wrote:
| I think while the findings are interesting, the title "Vision
| language models are blind" is hyperbolic and misleading of the
| findings. It is true that VLMs have a very different way of
| sensing and working with image inputs and for low resolutions
| each image is split up into blocks and then mapped into discrete
| tokens. They actually really dont have access to finegrained
| details because this mapping is quite lossy. So, in that sense,
| these results completely make sense and were not surprising to me
| but "blind" is a strong implication and the work isnt really able
| to back that.
|
| Take the first example of the four 2D line plots for instance. I
| asked Sonnet 3.5 for an answer and retried it 5 times and 2 out
| of 5 times, it said something like this:
|
| _Me_ :
|
| here's an image with four 2-D line plots. Can you tell me A) how
| many lines are in each plot? and B) how many times do they
| intersect each other in each plot?
|
| Write your answers as: Plot 1 (leftmost): ...
|
| _Sonnet_ :
|
| I'll analyze each plot from left to right and provide the
| requested information:
|
| Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0
| intersections (the lines do not cross)
|
| Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the
| lines cross once near the right edge)
|
| Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the
| lines cross once near the center)
|
| Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2
| intersections (the lines cross twice, once near the left side and
| once near the right side)
|
| Each plot consistently shows two lines, one red and one blue,
| with varying intersection patterns across the four plots.
|
| ---
|
| Which isn't completely correct (the 3rd plot has 2 intersections)
| but still pretty good and the other 3 times, it said plot 3 has 1
| intersection.
| Jack000 wrote:
| This is kind of the visual equivalent of asking an LLM to count
| letters. The failure is more related to the tokenization scheme
| than the underlying quality of the model.
|
| I'm not certain about the specific models tested, but some VLMs
| just embed the image modality into a single vector, making these
| tasks literally impossible to solve.
| JeremyHerrman wrote:
| VLMs so far have never been good at counting objects or spatial
| relationships (e.g. the coffee is to the right of the microwave).
|
| There are ways to help the VLM out - Set of Marks [0] from
| Microsoft being the most prominent, which uses segmentation to
| outline and label sections of the image before sending to the
| VLM.
|
| Providing "speakable" labels to regions helps ground the visual
| abilities of VLMs and is why in this paper the performance is so
| much better when words are present in the grid for "Task 6:
| Counting the rows and columns of a grid"
|
| 0: https://github.com/microsoft/SoM
| jazzyjackson wrote:
| I didn't know counting objects was a problem. That's pretty
| ironic because the very first implementation of a neural net
| (AFAIK) is the numa-rete artificial retina developed at the
| Biological Computer Lab [0] circa 1960. It was a parallel
| analog computer composed of "nuerons" each with a photocell
| that could be arranged in a grid and count "the number of
| objects independent of their size, location and form, and
| independent of strength of illumination" [1] - this paper may
| be of interest to those in the field, "Perception of Form in
| Biological and Man Made Systems" Heinz Von Foerster 1962
|
| [0]
| https://distributedmuseum.illinois.edu/exhibit/biological_co...
|
| [1] https://sites.evergreen.edu/arunchandra/wp-
| content/uploads/s...
| empath75 wrote:
| It really shouldn't be surprising that these models fail to
| do anything that _they weren't trained to do_. It's trivially
| easy to train a model to count stuff. The wild thing about
| transformer based models is that their capabilities are _way_
| beyond what you'd expect from token prediction. Figuring out
| what their limitations actually are is interesting because
| nobody fully knows what their limitations are.
| jazzyjackson wrote:
| I agree that these open ended transformers are way more
| interesting and impressive than a purpose built count-the-
| polygons model, but if the model doesn't generalize well
| enough to figure out how to count the polygons, I can't be
| convinced that they'll perform usefully on a more
| sophisticated task.
|
| I agree this research is really interesting, but I didn't
| have an a priori expectation of what token prediction could
| accomplish, so my reaction to a lot of the claims and
| counterclaims of this new tech is that it's good at fooling
| people and giving plausible but baseless results. It makes
| for good research but dangerous in the hands of a market
| attempting to exploit it.
| empath75 wrote:
| > I agree that these open ended transformers are way more
| interesting and impressive than a purpose built count-
| the-polygons model, but if the model doesn't generalize
| well enough to figure out how to count the polygons, I
| can't be convinced that they'll perform usefully on a
| more sophisticated task.
|
| I think people get really wrapped into the idea that a
| single model needs to be able to do all the things, and
| LLMs can do a _lot_, but there doesn't actually need to
| be a _one model to rule them all_. If VLMs are kind of
| okay at image intepretation but not great at details, we
| can supplement them with something that _can_ handle the
| details.
| Eisenstein wrote:
| Vision models use CLiP or something similar, which has no
| conception of anything specific in the image. It sees
| embeddings which correlate similarly to text embeddings. Take
| an image then describe it 'there are birds sitting on a power
| line in front of a blue sky with some clouds', get the
| embeddings from that and the embeddings from that picture and
| line them up. If you ask if there are birds in it, it would
| know, but not how many, unless it was common to describe the
| number of birds sitting on things and it happened often enough
| that the number counted was the number in the image
| descriptions it trained on. If you want to count objects you
| want something like YOLO.
| JeremyHerrman wrote:
| VLMs like PaliGemma and Florence-2 support object detection
| and segmentation, so it's becoming more common to have YOLO
| like capabilities built into VLMs.
|
| Another benefit of VLMs which support object detection is
| that they are open vocabulary, meaning you don't have to
| define the classes ahead of time. Additionally fine tuning
| tends to keep the previous detection capabilities instead of
| erasing all previous classes like fine tuning YOLO.
| GaggiX wrote:
| Well, all the models (especially Claude 3.5 Sonnet) seem to
| perform much better than random, so they are clearly not blind.
| The only task where Claude Sonnet 3.5 does not perform better
| than random is the one where you have to follow many different
| paths (the ones where the answer from A to C is 3), something
| that would take me several seconds to solve.
|
| I have the feeling that they first choose the title of the paper
| and then run the evaluation on the new Claude 3.5 Sonnet on these
| abstract images.
|
| >their vision is, at best, like that of a person with myopia
| seeing fine details as blurry
|
| This also makes no sense, since the images evaluate the abstract
| capabilities of the models, not their eyesight.
| randcraw wrote:
| OK. They're _legally_ blind.
| GaggiX wrote:
| This really has nothing to do with vision impairment.
| iamleppert wrote:
| This could easily be fixed with training and fine tuning. Simply
| generate 100,000 examples or so, and train with ground truth
| until however long you want and its a solved problem.
| kristjansson wrote:
| Solved for this benchmark... and at what cost to the rest of
| the system?
|
| These tasks are interesting because they're existence proofs of
| generalization failure. Like the haystack problem, direct
| solutions here are much less interesting than structural
| improvements that address the class of failure.
| imtringued wrote:
| Ok, but most of the data is just captions for images. You're
| going to have to invest some time into building this dataset at
| your own expense.
| _vaporwave_ wrote:
| It's really interesting that there's a huge performance
| discrepancy between these SOTA models. In the Olympic logo
| example, GPT-4o is below the baseline accuracy of 20% (worse than
| randomly guessing) while Sonnet-3.5 was correct ~76% of the time.
|
| Does anyone have any technical insight or intuition as to why
| this large variation exists?
| ec109685 wrote:
| The question wasn't "yes or no" but instead required an exact
| number:
| https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...
|
| Playing around with GPT-4o, it knows enough to make a copy of
| an image that is reasonable but it still can't answer the
| questions.
|
| ChatGPT went down a rabbit hole of trying to write python code,
| but it took lots of prompting for it to notice its mistake when
| solving one of the intersecting line questions.
| londons_explore wrote:
| Could some of the "wrong" answers be the LLM attempting to give
| an explanation rather than the answer, eg. instead of answering
| 'X', the LLM answers 'The letter is partially hidden by the oval,
| so cannot be certain, but it appears to be the english letter X'.
|
| The scoring criteria would rank this answer as 'T', which is
| wrong.
| simonw wrote:
| I've been generally frustrated at the lack of analysis of vision
| LLMs generally.
|
| They're clearly a very exciting category of technology, and a
| pretty recent one - they only got good last October with GPT-4
| Vision, but since then we've had more vision models from
| Anthropic and Google Gemini.
|
| There's so much more information about there about text prompting
| compared to image prompting. I feel starved for useful
| information about their capabilities: what are vision models good
| and bad at, and what are the best ways to put them to work?
| r2_pilot wrote:
| Why not use them yourself if you have access? I have been using
| Claude 3.5 Sonnet for gardening recently, and while it's not
| perfect(and can be a little blind unless you tell it to focus
| on a specific thing), it's helped me understand how to keep my
| plants alive in some challenging conditions(for me; this is my
| second or third attempt at gardening so it's all challenging
| lol). But just a experiment with it and see where the
| capabilities lie. I do agree that certain classes of visual
| data are challenging for it.
| simonw wrote:
| I've used them a bunch. I want to learn from other people's
| experiences as well.
|
| Some of my notes so far:
|
| - https://simonwillison.net/2024/Apr/17/ai-for-data-
| journalism... - my datasette-extract plugin, for structured
| data from both text and images
|
| - https://simonwillison.net/2024/Apr/17/ai-for-data-
| journalism... - where they failed to extract data from a
| handwritten scanned document in various weird ways
|
| - https://simonwillison.net/2024/Feb/21/gemini-pro-video/
| talks about video inputs to Gemini Pro (which are actually
| image inputs, it splits them up to one frame per second)
| simonw wrote:
| Anthropic have some interesting cookbook examples that provide
| advice on using their multimodal models here:
| https://github.com/anthropics/anthropic-cookbook/tree/main/m...
|
| I've assembled a bunch more notes here:
| https://simonwillison.net/tags/vision-llms/
| mglz wrote:
| I tought some Computational Geometry courses and efficiently
| computing the intersections of N line segments is not as
| straightforward as you might initially think. Since somewhere
| some computation must be done to recognize this and LLMs are not
| specifically trained for this task, it's not suprising they
| struggle.
|
| In general, basic geometry seems under-explored by learning.
| jordan_bonecut wrote:
| Yes, but so is telling if a photo contains a dog or
| understanding sentiment in a paragraph of text. Complexity
| isn't quite the issue, I think it is that there is a
| distinction between the type of reasoning which these models
| have learnt and that which is necessary for concrete
| mathematical reasoning.
| slashdave wrote:
| The models do not reason. They have learned associations,
| because these associations have appeared in their training
| sets.
| samatman wrote:
| > _Since somewhere some computation must be done to recognize
| this_
|
| Humans don't have a "compute intersections" ability (other than
| a few who have learned it laboriously through algebra), we have
| a "see things and count them" mechanism. We aren't visually
| taking lines in a planar space and determining where they
| cross. We know what an intersection looks like, we see one,
| increment a counter, and find the next one. If it's less than
| around five, we do this all at once. Otherwise we literally
| count, sometimes in small groups, sometimes one at a time.
| orbital-decay wrote:
| That's not anything like "myopia", though.
|
| FWIW I tried the line intersection and the circled letter test
| from the article with CogVLM (which is far from reaching the
| current SotA) and it correctly passed both. I haven't tried it
| with Sonnet/4o but I suspect there might be something wrong with
| how the author did their tests. Don't get me wrong, but too many
| "the model can't do that" claims ended up with demonstrations of
| the model doing exactly that...
| nyxtom wrote:
| I wonder how well Alpha Geometry would do on this
| nybsjytm wrote:
| AlphaGeometry is a hyper-specific system trained to add
| auxiliary geometric objects, like extra lines, to existing
| Euclidean geometry configurations. These prompts are not even
| sensible inputs to AlphaGeometry.
| hi_dang_ wrote:
| I was hoping that someone in the comments talking the paper down
| would have published a paper or have had relevant publications of
| their own to point to. You know, meet the lads halfway sort of
| thing.
|
| So what I'm left with to judge instead is anonymous online
| commenters vs. the publication of 2 prestigious universities.
| Whose word do I take on this? Decisions, decisions.
|
| You can swap LM out with Web3 out with NFT out with Crypto in
| this case.
| warkdarrior wrote:
| > I'm left with [...] is anonymous online commenters vs. the
| publication of 2 prestigious universities. Whose word do I take
| on this?
|
| Maybe you need to judge the contents of those online comments
| and the contents of the publication, instead of relying on
| argument from authority.
| vessenes wrote:
| A few comments below talk about how tokenizing images using stuff
| like CLIP de-facto yields blurry image descriptions, and so these
| are 'blind' by some definitions. Another angle of blurring not
| much discussed is that the images are rescaled down; different
| resolutions for different models. I wouldn't be surprised if
| Sonnet 3.5 had a higher res base image it feeds in to the model.
|
| Either way, I would guess that we'll need new model architectures
| for multimodal to get really good at some of this, and even then
| some of these tasks are adjacent to things that we know LLMs are
| already bad at (numeric logic, for instance).
|
| As context lengths get longer, devoting more tokens to the image
| tokenization should help a bit here as well. Anyway, I'd
| anticipate next year we'd see 80s and 90s for most of these
| scores with next gen models.
| imtringued wrote:
| The problem with the current crop of projectors such as LLaVA
| is that as far as I know do not take the previous conversation
| into account. You only really get zero shot responses. This
| means that you cannot steer the model towards paying attention
| to specific instruction related details. The projector simply
| creates a token representation of the visuals (not necessarily
| human language tokens) and the LLM just processes that as
| usual.
| vessenes wrote:
| The original gpt4 did this too, it had almost no memory
| before or after the image provided. I haven't tested gpt4o on
| this directly, but my feeling is that it's better from casual
| usage.
|
| I do think some of these thin line drawings are likely extra
| hard to tokenize depending on the image scaling sizes for
| tokenization. I'd wager thicker lines would help, although
| obviously not all of this is just 'poor tokenization'.
| ec109685 wrote:
| At least for gpt 4o, it can create a facsimile of images that
| it still can't analyze properly, so I think it's more than just
| its "eyes" that are broken.
|
| It clearly wasn't trained on this task and suffers accordingly.
|
| However, with chatgpt, it will create python to do the analysis
| and has better results.
| spullara wrote:
| in other news, vision models are bad at things they aren't
| trained to do
| akavi wrote:
| Speaking as someone with only a tenuous grasp of how VLMs work,
| this naively feels like a place where the "embodiement" folks
| might have a point: Humans have the ability to "refine" their
| perception of an image iteratively, focusing in on areas of
| interest, while VLMs have to process the entire image at the same
| level of fidelity.
|
| I'm curious if there'd be a way to emulate this (have the visual
| tokens be low fidelity at first, but allow the VLM to emit tokens
| that correspond to "focusing" on a region of the image with
| greater resolution). I'm not sure if/how it's possible to
| performantly train a model with "interactive" data like that,
| though
| efskap wrote:
| Isn't this the attention mechanism, the reason we're using
| transformers for these things? Maybe not greater resolution per
| se, but focusing on a region with greater neural connectivity
| akavi wrote:
| Ah, good point!
|
| But the model is downstream of the "patch" tokenization, so
| the cut-down in resolution (compression) of the image has
| already occurred _prior_ to the point where the model can
| direct greater "attention".
|
| I think the synthesis is that I'm proposing a per-pixel
| tokenization with a transformer block whose purpose is to
| output information at a compression level "equivalent" to
| that of the patch tokens (is this what an autoencoder is?),
| but where the attention vector is a function of the full
| state of the LLM (ie, inclusive of the text surrounding the
| image)).
|
| Naively, I'd think a layer like this that is agnostic to the
| LLM state needn't be any more computationally costly than the
| patching computation (both are big honks of linear algebra?),
| but idk how expensive the "full context attention" feedback
| is...
|
| (I apologize to anyone who actually understands transformers
| for my gratuitous (ab|mis)use of terminology)
| Brechreiz wrote:
| >Humans have the ability to "refine" their perception of an
| image iteratively
|
| That's not related to embodied cognition.
| akavi wrote:
| Is embodied cognition not at least in part about
| interactivity? I perform action (emit tokens) and receive
| feedback (non-self-generated tokens)
| kromem wrote:
| Lots and lots of eye tracking data paired with what was being
| looked at in order to emulate human attention processing might
| be one of the lower hanging fruits for improving it.
| caddemon wrote:
| Humans are actually born with blurry vision as the eye takes
| time to develop, so human learning starts with low resolution
| images. There is a theory that this is not really a limitation
| but a benefit in developing our visual processing systems.
| People in poorer countries that get cataracts removed when they
| are a bit older and should at that point hardware-wise have
| perfect vision do still seem to have some lifelong deficits.
|
| It's not entirely known how much early learning in low
| resolution makes a difference in humans, and obviously that
| could also relate more to our specific neurobiology than a
| general truth about learning in connectionist systems. But I
| found it to be an interesting idea that maybe certain outcomes
| with ANNs could be influenced a lot by training paradigms s.t.
| not all shortcomings could be addressed with only updates to
| the core architecture.
| slashdave wrote:
| These models have learned to focus on specific portions of an
| image (after all, this is the stated purpose of a transformer).
| tantalor wrote:
| Are the "random-baseline accuracy" numbers correct?
|
| In the "Two circles" test, do they really have 50% chance of
| overlapping? I think this comes from "Distances between circle
| perimeters: -0.15 to 0.5 times the diameter" but doesn't say the
| distribution they use.
| jdlshore wrote:
| They asked the AI a question with a yes/no response. If the AI
| chose randomly, it would be correct 50% of the time. That's
| what "random baseline accuracy" means.
| jeromeparadis wrote:
| One use-case I always try is to have an AI try to read a school
| calendar image where days off are or days of interest are
| highlighted using a legend. i.e.: days with a square, circle or
| triangle or different color, etc.
|
| When asking days for specific days of interest for the school
| year, AIs always struggle. They get some days right but forget
| some or fabulate new days. They fare a bit better if you remove
| some of the noise and give them only a picture of a month but
| even then, it's unreliable.
| verbalstoner wrote:
| It's virtually impossible to take a paper seriously when the
| title has an emoji.
| axblount wrote:
| Would you say they have _Blindsight_?
| pjs_ wrote:
| I don't like this paper for the following reasons:
|
| - The language is unnecessarily scathing
|
| - They repeatedly show data where the models are getting things
| _right_ 70, 80, 90% of the time, and then show a list of what
| they call "qualitative samples" (what does "qualitative" mean?
| "cherry-picked"?) which look very bad. But it got the answer
| right 70/80/90% of the time! That's hardly "blind"...
|
| - Various of the tasks hinge on the distinction between two
| objects "exactly touching" vs. "very nearly touching" vs. "very
| slightly overlapping", a problem which (i) is hard for humans and
| (ii) is particularly (presumably deliberately) sensitive to
| resolution/precision, where we should not be surprised that
| models fail
|
| - The main fish-shaped example given in task 1 seems genuinely
| ambiguous to me - do the lines "intersect" once or twice? The
| tail of the fish clearly has a crossing, but the nose of the fish
| seems a bit fishy to me... is that really an intersection?
|
| - AFAIC deranged skepticism is just as bad as deranged hype, the
| framing here is at risk of appealing to the former
|
| It's absolutely fair to make the point that these models are not
| perfect, fail a bunch of the time, and to point out the edge
| cases where they suck. That moves the field forwards. But the
| hyperbole (as pointed out by another commenter) is very annoying.
| neuronet wrote:
| To be fair, the paper has an emoji in the _title_, so I
| wouldn't read it as a particularly particularly serious
| academic study as much as the equivalent of the Gawker of AI
| research. It is a "gotcha" paper that exploits some blind spots
| (sorry) that will easily be patched up with a few batches of
| training. I do think it highlights the lack of AGI in these
| things, which some people lacking situational awareness might
| need to see.
| numeri wrote:
| I'm also confused about some of the figures' captions, which
| don't seem to match the results:
|
| - "Only Sonnet-3.5 can count the squares in a majority of the
| images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have
| accuracy of >50%
|
| - "Sonnet-3.5 tends to conservatively answer "No" regardless of
| the actual distance between the two circles.", but it somehow
| gets 91% accuracy? That doesn't sound like it tends to answer
| "No" regardless of distance.
| schneehertz wrote:
| I am not sure where their experimental data came from. I tested
| it on GPT-4o using the prompt and images they provided, and the
| success rate was quite high, with significant differences from
| the results they provided.
| ec109685 wrote:
| Their examples are here: https://huggingface.co/datasets/XAI/
| vlmsareblind/viewer/defa...
|
| ChatGPT whiffs completely on very obvious images.
| cpill wrote:
| I wonder how they would score if they used all 4 models and took
| a majority vote...?
| aaroninsf wrote:
| The title for this page and argument should be qualified with the
| specific generation of tools.
|
| That's in the abstract, but, it's bad to not be specific. In this
| case, because current public-facing models are WIWEB: the worst
| it will ever be.
|
| And there are trillion-dollar prizes at stake, so, improvement is
| happening as quickly as it possibly can.
| make3 wrote:
| Hugged to death from my perspective. Here is a backup:
| https://archive.ph/kOE3Q
| simonw wrote:
| That's weird - GitHub Pages serves static content and rarely
| (in my experience) fails to load.
| jetrink wrote:
| I had a remarkable experience with GPT-4o yesterday. Our garage
| door started to fall down recently, so I inspected it and found
| that our landlord had installed the wire rope clips incorrectly,
| leading to the torsion cables losing tension. I didn't know what
| that piece of hardware was called, so I asked ChatGPT and it
| identified the part as I expected it to. As a test, I asked if
| there was anything notable about the photo. ChatGPT correctly
| identified that the cables were installed backwards, with the
| side of the cable that was (previously) under tension on top of
| the slack end, instead of sandwiched securely in the middle. To
| diagnose that requires tracing the cable through space and
| inferring which end is under tension from the geometry, though I
| can't rule out an educated guess.
|
| What was really remarkable though was that it failed to notice
| that one of the two nuts was obviously missing, even after I told
| it there was a second problem with the installation.
|
| Screenshot: https://imgur.com/a/QqCNzOM
| sfink wrote:
| A _human_ would need to trace the cable. An LLM may just be
| responding based on (1) the fact that you 're asking about the
| clip in the first place, and that commonly happens when there's
| something wrong; and (2) that this is a very common failure
| mode. This is supported by it bringing up the "never saddle a
| dead horse" mnemonic, which suggests the issue is common.
|
| After you fix it, you should try asking the same questions!
| fn-mote wrote:
| As a human, I was unable to see enough in that picture to infer
| which side was supposed to be under tension. I'm not trained,
| but I know what I expected to see from your description.
|
| Like my sister post, I'm skeptical that the LLM didn't just get
| lucky.
| nmca wrote:
| please use this opportunity to reflect on whether ARC measures
| reasoning skills :)
| gnutrino wrote:
| My guess is that the systems are running image recognition
| models, and maybe OCR on images, and then just piping that data
| as tokens into an LLM. So you are only ever going to get results
| as good as existing images models with the results filtered
| through an LLM.
|
| To me, this is only interesting if compared with results of image
| recognition models that can already answer these types of
| questions (if they even exist, I haven't looked).
|
| Maybe the service is smart enough to look at the question, and
| then choose one or more models to process the image, but not sure
| as I can't find anything on their sites about how it works.
| Eisenstein wrote:
| > My guess is that the systems are running image recognition
| models
|
| Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an
| idea of how they work.
| gnutrino wrote:
| Will do, thank you.
| simonw wrote:
| That's not how they work. The original GPT-4 paper has some
| detail: https://cdn.openai.com/papers/gpt-4.pdf
|
| Or read up on PaliGemma: https://github.com/google-
| research/big_vision/blob/main/big_...
| gnutrino wrote:
| Thanks, I'll read up on this.
| nichohel wrote:
| Vision language models are blind because they lack the Cartesian
| Theater, which you and I have. Which you and I say we have.
| codeulike wrote:
| Does the part of you that 'looks at' your cartesian theatre
| also have a cartesian theatre?
| fleshmonad wrote:
| [citation needed]
| viraptor wrote:
| I love some of the interpretations there. For example "Fig. 10:
| Only Sonnet-3.5 can count the squares in a majority of the
| images.", when that model simply returns "4" for every question
| and happens to be right.
| jackblemming wrote:
| Ask it to draw any of those things and it can.
| mkoubaa wrote:
| They interact with pixel buffers as a mathematical array. To call
| them blind is to confuse what they doing with the experience of
| sight...
| codeulike wrote:
| Humans 'see' by tightly packed rods and cones in the retina
| sending signals up the optic nerve. Not as tidy as a
| mathematical array but nonetheless not all that different.
| Ultimately what comes to the brain from the retina can be
| thought of as a data structure of sorts.
| Rebuff5007 wrote:
| In fairness, Mira Murati said GPT-4 is only high school level
| [1]. Maybe it takes PhD level to understand basic shapes?
|
| [1] https://www.ccn.com/news/technology/openais-gpt-5-phd-
| level-...
| jordan_bonecut wrote:
| This is an interesting article and goes along with how I
| understand how such models interpret input data. I'm not sure I
| would characterize the results as blurry vision, but maybe an
| inability to process what they see in a concrete manner.
|
| All the LLMs and multi-modal models I've seen lack concrete
| reasoning. For instance, ask ChatGPT to perform 2 tasks, to
| summarize a chunk of text and to count how many words are in this
| chunk. ChatGPT will do a very good job summarizing the text and
| an awful job at counting the words. ChatGPT and all the
| transformer based models I've seen fail at similar
| concrete/mathematical reasoning tasks. This is the core problem
| of creating AGI and it generally seems like no one has made any
| progress towards synthesizing something with both a high and low
| level of intelligence.
|
| My (unproven and probably incorrect) theory is that under the
| hood these networks lack information processing loops which make
| recursive tasks, like solving a math problem, very difficult.
| scarface_74 wrote:
| Out of curiosity, I tried your test with ChatGPT 4o
|
| https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...
|
| Of course as usual, LLMs are horrible with Math.
|
| Funny enough, the next time it verified the word count by
| counting it out until I specifically told it to use Python
|
| https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...
| infiar wrote:
| This counting words task reminded me of a youtube video:
| https://www.youtube.com/watch?v=-9XKiOXaHlI Maybe LLMs are
| somehow more like monkeys.
| empiricus wrote:
| I hope you are aware of the fact that LLMs does not have direct
| access to the stream of words/characters. It is one of the most
| basic things to know about their implementation.
| jordan_bonecut wrote:
| Yes, but it could learn to associate tokens with word counts
| as it could with meanings.
|
| Even still, if you ask it for token count it would still
| fail. My point is that it can't count, the circuitry required
| to do so seems absent in these models
| randomtree wrote:
| I guess I know what's coming to every captcha tomorrow.
| michaelhoney wrote:
| This says to me that there are huge opportunities for improvement
| in providing vision modules for LLMs. Human minds aren't made of
| just one kind of thing: we have all sorts of hacky modular
| capabilities - there's no reason to think that a future AGI
| wouldn't also.
| joelburget wrote:
| Vision Transformers do a shocking amount of compression in the
| tokenizer. In the [Chameleon
| paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer
| "encodes a 512 x 512 image into 1024 discrete tokens from a
| codebook of size 8192". That's 256 pixels per token (512 * 512 /
| 1024). If we assume that a pixel is 24 bits (3x 8 bit channels),
| this implies that they've compressed 256 * 24 = 6144 bits into 13
| = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction
| and Generation](https://yucornetto.github.io/projects/titok.html)
| pushes this even further. If these models work similarly, it's no
| wonder they struggle with some vision tasks.
| energy123 wrote:
| GPT-4o is very good at some visual tasks like optical character
| recognition. So the selective blindness might just be what you
| say here -- all of its capacity is dedicated to minimizing loss
| on a few narrow tasks that had the most training data (like
| OCR). So it's not necessarily an inherent failure of the
| architecture to generalize, it could just be a capacity issue
| that will naturally be resolved with more scale.
| sushid wrote:
| Is that not just traditional OCR applied on top of LLM?
| energy123 wrote:
| It's possible they have a software layer that does that.
| But I was assuming they don't, because the open source
| multimodal models don't.
| maxlamb wrote:
| No it's not, it's a multimodal transformer model.
| buryat wrote:
| for some reason I started thinking about trying to describe the
| taste of a fruit to someone who hasn't tried it as something
| that can be similar to this as a non-visual sensory modal in
| humans
| ec109685 wrote:
| It's not as simple as that. If you ask GPT-4o to create a copy
| of these images, it generally creates one faithfully (e.g. an
| image with 5 squares will be produced), so it's "seeing" things
| reasonably enough.
|
| It doesn't seem to have the logic though to answer these
| questions.
|
| The complete data set is here to play around with it yourself:
| https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...
| kristianpaul wrote:
| We see through thoughts and memories. We see when we desire, the
| vision just adds on a word pf thoughts and consciousness of being
| conscious.
|
| Vision links thoughts with reality
| navaed01 wrote:
| Is there a good primer on how these vision LlmS work?
| yantrams wrote:
| Tested these problems with llava-v1.6-mistral-7b and the results
| aren't bad. Maybe I just got lucky with these samples
|
| Intersecting Lines
| https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc
|
| Overlapping Circles
| https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384
|
| Touching Circles
| https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc
|
| Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc
|
| Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm
| simonw wrote:
| These are really interesting examples, thanks for sharing.
| yantrams wrote:
| You're welcome. I recently noticed I get better performance
| with VLMs when the queries are phrased this way - Descriptive
| Keys instead of explaining the problem in sentences. Similar
| to COT reasoning that many people claim gives better results,
| I personally found querying in this sequence -
| existenceOfEntity, numberOfEntities followed by
| propertiesOfEntities etc tends to give better results. I
| haven't verified any of this rigorously so please do take it
| with a pinch of salt :)
| poikroequ wrote:
| It's ironic, they fail these seemingly simple tests that are
| trivial even for a child to solve. Yet, I used Gemini to read a
| postcard containing handwritten Russian cursive text with lots of
| visual noise (postmarks and whatnot). It was able to read the
| text and translate it into English. I didn't even need to tell it
| the text is Russian.
|
| On the one hand, it's incredible what these LLMs are capable of.
| On the other hand, they often fall flat on their face with
| seemingly simple problems like this. We are seeing the same from
| self driving cars, getting into accidents in scenarios that
| almost any human driver could have easily avoided.
| slashdave wrote:
| Simple for a child, yes. Because we have evolved our vision to
| recognize patterns like this, because they are important for
| survival. Reading Russian is not.
|
| From an algorithmic point of view, these vision tasks are
| actually quite difficult to explicitly program.
| nothrowaways wrote:
| The next version will solve all of it.
| childintime wrote:
| Claude 3.5 does remarkably well though on many tasks, compared to
| the others, and on those it's not at all blind. It's getting
| there.
___________________________________________________________________
(page generated 2024-07-11 23:02 UTC)