[HN Gopher] Vision language models are blind
       ___________________________________________________________________
        
       Vision language models are blind
        
       Author : taesiri
       Score  : 428 points
       Date   : 2024-07-10 13:35 UTC (1 days ago)
        
 (HTM) web link (vlmsareblind.github.io)
 (TXT) w3m dump (vlmsareblind.github.io)
        
       | taesiri wrote:
       | This paper examines the limitations of current vision-based
       | language models, such as GPT-4 and Sonnet 3.5, in performing low-
       | level vision tasks. Despite their high scores on numerous
       | multimodal benchmarks, these models often fail on very basic
       | cases. This raises a crucial question: are we evaluating these
       | models accurately?
        
       | rezaghanbari1 wrote:
       | Some of these samples are shocking. How do these models answer
       | chart-based questions, I mean when they can't even count the
       | intersections between two lines?
        
         | RodgerTheGreat wrote:
         | Same way they answer any question: piece together a
         | statistically probable sequence of words to follow the prompt.
         | All they know about an image is a handful of words a classifier
         | might choose to describe it. If those words have nothing to do
         | with the question being asked, they can't nudge the model in
         | the general direction of a correct answer, so it's a crapshoot-
         | even moreso than usual.
        
         | imtringued wrote:
         | The dataset most likely contains chart descriptions that
         | describe the raw data, but not the visual interactions of the
         | individual pixels.
        
       | dheera wrote:
       | Current approaches of multi-modal models work on embeddings and
       | tokenizations of images, which is the fundamental problem: you
       | are feeding blurry, non-precise data into the model. Yes, they
       | are blind because of exactly this.
       | 
       | An embedding isn't conceptually that much different from feeding
       | a 1024-word description of an image instead of the actual image.
       | 
       | At the moment compute power isn't good enough to feed high-res
       | pixel data into these models, unless we discover a vastly
       | different architecture, which I am also convinced likely exists.
        
         | jayd16 wrote:
         | Doesn't Gemini have a 2 million token limit for exactly this?
        
           | diwank wrote:
           | The number of tokens _per image_ are actually fairly small,
           | ranging from 85 to ~500.
        
         | visarga wrote:
         | > An embedding isn't conceptually that much different from
         | feeding a 1024-word description of an image instead of the
         | actual image.
         | 
         | An embedding needs less words. You can embed individual words,
         | phrases, like a whole prompt and longer paragraphs. You don't
         | need 1024 words for a text embed. For example a famous library
         | is called Sentence BERT (sbert).
         | 
         | When you embed images on the other hand, you cut them up into
         | little squares on the tune of 32x32 px, and embeds one of them
         | separately. chatGPT uses something like 250 tokens for smaller
         | images. So a smaller image costs about as much as 200 words if
         | represented graphically, and maybe much less words if you embed
         | a text description of it.
        
           | dheera wrote:
           | > needs less words
           | 
           | Yes I'm aware of this, and work in ML -- the thing is
           | embeddings are not designed for faithful image
           | reconstruction, and aren't even trained that way. You can
           | easily find two images that have substantially similar CLIP
           | (or whatever) embeddings that are visually very different. If
           | you query the LLM about that difference, the LLM wouldn't
           | even have the information to differentiate answers for the
           | two images if you only supply it with the embedding.
           | 
           | On the other hand, SDXL autoencoder latents passed into an
           | LLM alongside the embedding _might_ be a step up from just an
           | image embedding, since they are designed for image
           | reconstruction, but I don 't have access to the compute or
           | data resources to attempt training this.
        
       | cs702 wrote:
       | Wow, that is _embarrassingly bad performance_ for current SOTA
       | models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are
       | advertised and sold as being able to understand images, e.g., for
       | guiding the blind or tutoring children in geometry!
       | 
       | The tasks at which they fail are ridiculously simple for human
       | beings, including, for example:
       | 
       | * counting the number of times two lines intersect;
       | 
       | * detecting whether two circles overlap;
       | 
       | * selecting which letter is being circled in a word;
       | 
       | * counting the number of circles in an Olympic-like logo.
       | 
       | This should be at the top of the front page.
        
         | tensor wrote:
         | I don't see how this is "embarrassing" in the slightest. These
         | models are not human brains, and the fact that people equate
         | them with human brains is an embarrassing failure of the humans
         | more than anything about the models.
         | 
         | It's entirely unsurprising that there are numerous cases that
         | these models can't handle that are "obvious to humans." Machine
         | learning has had this property since its invention and it's a
         | classic mistake humans make dealing with these systems.
         | 
         | Humans assume that because a machine learning model has above
         | human accuracy on task X that it implies that it must also have
         | that ability at all the other tasks. While a human with amazing
         | ability at X would indeed have amazing abilities at other
         | tasks, this is not true of machine learning models The opposite
         | thinking is also wrong, that because the model can't do well on
         | task Y it must be unreliable and it's ability on task X is
         | somehow an illusion and not to be trusted.
        
           | cs702 wrote:
           | It is embarrassingly, shockingly bad, because these models
           | are _advertised_ and _sold_ as being capable of understanding
           | images.
           | 
           | Evidently, all these models still fall short.
        
             | kristjansson wrote:
             | It's surprising because these models are pretty ok at some
             | vision tasks. The existence of a clear failure mode is
             | interesting and informative, not embarrassing.
        
             | knowaveragejoe wrote:
             | Not only are they capable of understanding images(the kind
             | people might actually feed into such a system -
             | photographs), but they're pretty good at it.
             | 
             | A modern robot would struggle to fold socks and put them in
             | a drawer, but they're great at making cars.
        
               | pixl97 wrote:
               | I mean, with some of the recent demos, robots have got a
               | lot better at folding stuff and putting it up. Not saying
               | it's anywhere close to human level, but it has taken a
               | pretty massive leap from being a joke just a few years
               | ago.
        
             | startupsfail wrote:
             | Humans are also shockingly bad on these tasks. And guess
             | where the labeling was coming from...
        
             | simonw wrote:
             | I see this complaint about LLMs all the time - that they're
             | advertised as being infallible but fail the moment you give
             | them a simple logic puzzle or ask for a citation.
             | 
             | And yet... every interface to every LLM has a "ChatGPT can
             | make mistakes. Check important info." style disclaimer.
             | 
             | The hype around this stuff may be deafening, but it's often
             | not entirely the direct fault of the model vendors
             | themselves, who even put out lengthy papers describing
             | their many flaws.
        
               | jazzyjackson wrote:
               | There's evidently a large gap between what researchers
               | publish, the disclaimers a vendor makes, and what gets
               | broadcast on CNBC, no surprise there.
        
               | jampekka wrote:
               | A bit like how Tesla Full Self-Driving is not to be used
               | as self-driving. Or any other small print. Or ads in
               | general. Lying by deliberately giving the wrong
               | impression.
        
               | verdverm wrote:
               | It would have to be called ChatAGI to be like TeslaFSD,
               | where the company named it something it is most
               | definitely not
        
             | fennecbutt wrote:
             | Why do people expect these models, designed to be humanlike
             | in their training, to be 100% perfect?
             | 
             | Humans fuck up all the time.
        
             | TeMPOraL wrote:
             | They're hardly being advertised or sold on that premise.
             | They advertise and sell themselves, because _people try
             | them out and find out they work_ , and tell their friends
             | and/or audiences. ChatGPT is probably the single biggest
             | bona-fide organic marketing success story in recorded
             | history.
        
               | foldr wrote:
               | This is fantastic news for software engineers. Turns out
               | that all those execs who've decided to incorporate AI
               | into their product strategy have already tried it out and
               | ensured that it will actually work.
        
               | ben_w wrote:
               | > Turns out that all those execs who've decided to
               | incorporate AI into their product strategy have already
               | tried it out and ensured that it will actually work.
               | 
               | The 2-4-6 game comes to mind. They may well have
               | _verified_ the AI will work, but it 's hard to learn the
               | skill of thinking about how to _falsify_ a belief.
        
           | mrbungie wrote:
           | These models are marketed as being able to guide the blind or
           | tutoring children using direct camera access.
           | 
           | Promoting those use cases and models failing in these ways is
           | irresponsible. So, yeah, maybe the models are not embarrasing
           | but the hype definitely is.
        
             | cs702 wrote:
             | _> Promoting those use cases and models failing in these
             | ways is irresponsible._
             | 
             | Yes, _exactly_.
        
           | scotty79 wrote:
           | You'd expect them to be trained on simple geometry since you
           | can create arbitrarily large synthetic training set for that.
        
           | sfink wrote:
           | Well said.
           | 
           | It doesn't matter how they are marketed or described or held
           | up to some standard generated by wishful thinking. And it
           | especially doesn't matter what it would mean if a human were
           | to make the same error.
           | 
           | It matters what they are, what they're doing, and how they're
           | doing it. Feel free to be embarrassed if _you_ are claiming
           | they can do what they can 't and are maybe even selling them
           | on that basis. But there's nothing embarrassing about their
           | current set of capabilities. They are very good at what they
           | are very good at. Expecting those capabilities to generalize
           | as they would if they were human is like getting embarrassed
           | that your screwdriver can't pound in a nail, when it is ever
           | so good at driving in screws.
        
           | insane_dreamer wrote:
           | > is an embarrassing failure of the humans more than anything
           | about the models
           | 
           | No, it's a failure of the companies who are advertising them
           | as capable of doing something which they are not (assisting
           | people with low vision)
        
             | simonw wrote:
             | But they CAN assist people with low vision. I've talked to
             | someone who's been using a product based on GPT-4o and
             | absolutely loves it.
             | 
             | Low vision users understand the limitations of
             | accessibility technology better than anyone else. They will
             | VERY quickly figure out what this tech can be used for
             | effectively and what it can't.
        
         | drodgers wrote:
         | I can't help but read comments like this as:
         | 
         | "My talking dog always makes mistakes on calculus problems: how
         | embarrassingly bad!"
         | 
         | Has the expectation treadmill really advanced so quickly that
         | sub-human performance on any category of problems is now an
         | embarrassment?
        
           | aezart wrote:
           | To me I guess it suggests that these models are not using the
           | correct approach. We keep finding new types of tasks the
           | models are bad at, then the next model fixes those issues
           | because those specific tasks are added to the training set.
           | But that approach never results in a generalized problem
           | solving ability, just an ability to solve all the problems
           | we've thought of so far.
        
       | sweezyjeezy wrote:
       | Entertaining, but I think the conclusion is way off.
       | 
       | > their vision is, at best, like that of a person with myopia
       | seeing fine details as blurry
       | 
       | is a crazy thing to write in an abstract. Did they try to probe
       | that hypothesis at all? I could (well actually I can't) share
       | some examples from my job of GPT-4v doing some pretty difficult
       | fine-grained visual tasks that invalidate this.
       | 
       | Personally, I rate this paper [1], which makes the argument that
       | these huge GenAI models are pretty good at things - _assuming
       | that it has seen a LOT of that type of data during training_
       | (which is true of a great many things). If you make up tasks like
       | this, then yes can be REALLY bad at them, and initial impressions
       | of AGI get harder to justify. But in practice, we aren 't just
       | making up tasks to trip up these models. They can be very
       | performant on some tasks and the authors have not presented any
       | real evidence about these two modes.
       | 
       | [1] https://arxiv.org/abs/2404.04125
        
         | diwank wrote:
         | Yeah I think their findings are def interesting but the title
         | and the strong claims are a tad hyperbolic.
        
         | SomaticPirate wrote:
         | There are quite a few "ai apologists" in the comments but I
         | think the title is fair when these models are marketed towards
         | low vision people ("Be my eyes"
         | https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent
         | to human vision. These models are implied to be human level
         | equivalents when they are not.
         | 
         | This paper demonstrates that there are still some major gaps
         | where simple problems confound the models in unexpected ways.
         | These is important work to elevate otherwise people may start
         | to believe that these models are suitable for general
         | application when they still need safeguards and copious
         | warnings.
        
           | sweezyjeezy wrote:
           | The paper I linked should hopefully mark me out as far from
           | an AI apologist, it's actually really bad news for GenAI if
           | correct. All I mean to say is the clickbait conclusion and
           | the evidence do not match up.
        
             | Melomololotolo wrote:
             | We have started the ara of ai.
             | 
             | It really doesn't matter how good current llms are.
             | 
             | They have been good enough to start this ara.
             | 
             | And no it's not and never has been just llms. Look what
             | Nvidia is doing with ml.
             | 
             | Whisper huge advantage, segment anything again huge. Alpha
             | fold 2 again huge.
             | 
             | All the robot announcements -> huge
             | 
             | I doubt we will reach agi just through llms. We will reach
             | agi through multi modal, mix of experts, some kind of
             | feedback loop, etc.
             | 
             | But the stone started to roll.
             | 
             | And you know I prefer to hear about ai advantages for the
             | next 10-30 years. That's a lot better than the crypto shit
             | we had the last 5 years.
        
               | zwnow wrote:
               | We won't reach agi in our lifetimes.
        
           | pixl97 wrote:
           | Yea, really if you look at human learning/seeing/acting there
           | is a feedback loop that LLM for example isn't able to
           | complete and train on.
           | 
           | You see an object. First you have to learn how to control all
           | your body functions to move toward it and grasp it. This
           | teaches you about the 3 dimensional world and things like
           | gravity. You may not know the terms, but it is baked in your
           | learning model. After you get an object you start building a
           | classification list "hot", "sharp", "soft and fuzzy",
           | "tasty", "slick". Your learning model builds up a list of
           | properties of objects and "expected" properties of objects.
           | 
           | Once you have this 'database' you create as a human, you can
           | apply the logic to achieve tasks. "Walk 10 feet forward, but
           | avoid the sharp glass just to the left". You have to have
           | spatial awareness, object awareness, and prediction ability.
           | 
           | Models 'kind of' have this, but its seemingly haphazard, kind
           | of like a child that doesn't know how to put all the pieces
           | together yet. I think a lot of embodied robot testing where
           | the embodied model feeds back training to the LLM/vision
           | model will have to occur before this is even somewhat close
           | to reliable.
        
             | TeMPOraL wrote:
             | Embodied is useful, but I think not necessary _even if_ you
             | need learning in a 3D environment. _Synthesized embodiment_
             | should be enough. While in some cases[0] it may have
             | problems with fidelity, simulating embodied experience _in
             | silico_ scales much better, and more importantly, _we have
             | control over time flow_. Humans always learn in real-time,
             | while with simulated embodiment, we could cram _years_ of
             | subjective-time experiences into a model in seconds, and
             | then for novel scenarios, spend an hour per each second of
             | subjective time running a high-fidelity physics
             | simulation[1].
             | 
             | --
             | 
             | [0] - Like if you plugged a 3D game engine into the
             | training loop.
             | 
             | [1] - Results of which we could hopefully reuse in training
             | later. And yes, a simulation could itself be a recording of
             | carefully executed experiment in real world.
        
               | pegasus wrote:
               | > Like if you plugged a 3D game engine into the training
               | loop
               | 
               | Isn't this what _synthesized embodiment_ basically always
               | is? As long as the application of the resulting
               | technology is in a restricted, well controlled
               | environment, as is the case for example for an assembly-
               | line robot, this is a great strategy. But I expect
               | fidelity problems will make this technique ultimately a
               | bad idea for anything that 's supposed to interact with
               | humans. Like self-driving cars, for example. Unless,
               | again, those self-driving cars are segregated in
               | dedicated lanes.
        
               | taneq wrote:
               | > Humans always learn in real-time
               | 
               | In the sense that we can't fast-forward our offline
               | training, sure, but humans certainly "go away and think
               | about it" after gaining IRL experience. This process
               | seems to involve both consciously and subconsciously
               | training on this data. People often consciously think
               | about recent experiences, run through imagined scenarios
               | to simulate the outcomes, plan approaches for next time
               | etc. and even if they don't, they'll often perform better
               | at a task after a break than they did at the start of the
               | break. If this process of replaying experiences and
               | simulating variants of them isn't "controlling the flow
               | of (simulated) time" I don't know what else you'd call
               | it.
        
           | Melomololotolo wrote:
           | Ah yes the blind person who constantly needs to know if two
           | lines intersect.
           | 
           | Let's just ignore what a blind person normally needs to know.
           | 
           | You know what blind people ask? Sometimes there daily routine
           | is broken because there is some type of construction and
           | models can tell you this.
           | 
           | Sometimes they need to read a basic sign and models can do
           | this.
           | 
           | Those models help people already and they will continue to
           | get better.
           | 
           | I'm not sure if I'm more frustrated how condescending the
           | authors are or your ignorance.
           | 
           | Valid criticism doesn't need to be shitty
        
             | shagie wrote:
             | As an aside... from 2016 this is what was a valid use case
             | for a blind person with an app.
             | 
             | Seeing AI 2016 Prototype - A Microsoft research project -
             | https://youtu.be/R2mC-NUAmMk
             | 
             | https://www.seeingai.com are the actual working apps.
             | 
             | The version from 2016 I recall showing (pun not intended)
             | to a coworker who had some significant vision impairments
             | and he was _really_ excited about what it could do back
             | then.
             | 
             | ---
             | 
             | I still remain quite impressed with its ability to parse
             | the picture and likely reason behind it
             | https://imgur.com/a/JZBTk2t
        
           | benreesman wrote:
           | If we're throwing "citation needed" tags on stuff, how about
           | the first sentence?
           | 
           | "Large language models with vision capabilities (VLMs), e.g.,
           | GPT-4o and Gemini-1.5 Pro are powering countless image-text
           | processing applications"
           | 
           | I don't know how many a "countless" is, but I think we've
           | gotten really sloppy in terms of what counts for LLMs as a
           | demonstrated, durable win in a concrete task attached to
           | well-measured outcomes and holding up over even modest
           | periods of time.
           | 
           | This stuff is really promising and lots of builders are
           | making lots of nifty things, so if that counts as an
           | application then maybe we're at countless, but in the
           | enterprise and in government and in refereed academic
           | literature we seem to be at the proof-of-concept phase.
           | Impressive chat bots as a use case are pretty dialed in,
           | enough people claim that they help with coding that I tend to
           | believe it's a real thing (I never seem to come out ahead of
           | going directly to the source, StackOverflow).
           | 
           | The amount of breathless press on this seems "countless", so
           | maybe I missed the totally rigorous case study on how X
           | company became Y percent more profitable by doing Z thing
           | with LLMs (or similar), and if so I'd be grateful for
           | citations, but neither Google nor any of the big models seem
           | to know about it.
        
             | dsr_ wrote:
             | "maybe I missed the totally rigorous case study on how X
             | company became Y percent more profitable by doing Z thing
             | with LLMs (or similar), and if so I'd be grateful for
             | citations, but neither Google nor any of the big models
             | seem to know about it."
             | 
             | Goldman Sachs recently issued a report.
             | 
             | https://www.goldmansachs.com/intelligence/pages/gs-
             | research/...
             | 
             | "We estimate that the AI infrastructure buildout will cost
             | over $1tn in the next several years alone, which includes
             | spending on data centers, utilities, and applications. So,
             | the crucial question is: What $1tn problem will AI solve?
             | Replacing low- wage jobs with tremendously costly
             | technology is basically the polar opposite of the prior
             | technology transitions I've witnessed in my thirty years of
             | closely following the tech industry"
        
           | Lerc wrote:
           | I disagree. I think the title, abstract, and conclusion not
           | only misrepresents the state of the models but it
           | misrepresents Thier own findings.
           | 
           | They have identified a class of problems that the models
           | perform poorly at and have given a good description of the
           | failure. They portray this as a representative example of the
           | behaviour in general. This has not been shown and is probably
           | not true.
           | 
           | I don't think that models have been portrayed as equivalent
           | to humans. Like most AI in it has been shown as vastly
           | superior in some areas and profoundly ignorant in others.
           | Media can overblow things and enthusiasts can talk about
           | future advances as if they have already arrived, but I don't
           | think these are typical portayals by the AI Field in general.
        
             | youssefabdelm wrote:
             | Exactly... I've found GPT-4o to be good at OCR for
             | instance... doesn't seem "blind" to me.
        
               | spookie wrote:
               | You don't really need a LLM for OCR. Hell, I suppose they
               | just run a python script in its VM and rephrase the
               | output.
               | 
               | At least that's what I would do. Perhaps the script would
               | be a "specialist model" in a sense.
        
               | Spivak wrote:
               | It's not that you need an LLM for OCR but the fact that
               | an LLM can do OCR (and handwriting recognition which is
               | much harder) despite not being made specifically for that
               | purpose is indicative of something. The jump from knowing
               | "this is a picture of a paper with writing on it" like
               | what you get with CLIP to being able to reproduce what's
               | on the paper is, to me, close enough to seeing that the
               | difference isn't meaningful anymore.
        
               | acheong08 wrote:
               | GPT-4v is provided with OCR
        
               | letmevoteplease wrote:
               | No reason to believe that. Open source VLMs can do
               | OCR.[1]
               | 
               | [1] https://huggingface.co/spaces/opencompass/open_vlm_le
               | aderboa...
        
               | simonw wrote:
               | That's a common misconception.
               | 
               | Sometimes if you upload an image to ChatGPT and ask for
               | OCR it will run Python code that executes Tesseract, but
               | that's effectively a bug: GPT-4 vision works much better
               | than that, and it will use GPT-4 vision if you tell it
               | "don't use Python" or similar.
        
               | Foobar8568 wrote:
               | I know that GPT-4o is fairly poor to recognize music
               | sheets and notes. Totally off the marks, more often than
               | not, even the first note is not recognize on a first week
               | solfege book.
               | 
               | So unless I missed something but as far as I am
               | concerned, they are optimized for benchmarks.
               | 
               | So while I enjoy gen AI, image-to-text is highly subpart.
        
               | youssefabdelm wrote:
               | Useful to know, thank you!
        
               | stavros wrote:
               | Most adults with 20/20 vision will also fail to recognize
               | the first note on a first week solfege book.
        
               | prmoustache wrote:
               | Well maybe not blind but the analogy with myopia might
               | stand.
               | 
               | For exemple in the case of OCR, a person with myopia will
               | usually be able to make up letters and words even without
               | his glasses based on his expectation (similar to vlm
               | training) of seeing letters and words in, say, a sign. He
               | might not see them all clearly and do some errors but
               | might recognize some letters easily and make up the rest
               | based on context, words recognition, etc. Basically
               | experience.
               | 
               | I also have a funny anecdote about my partner, which has
               | sever myopia, who once found herself outside her house
               | without her glasses on, and saw something on the grass
               | right in front. She told her then brother in law "look, a
               | squirrel" Only for the "squirrel" to take off while
               | shouting its typical caws. It was a crow. This is typical
               | of VLM's hallucinations.
        
             | subroutine wrote:
             | I think the conclusion of the paper is far more mundane.
             | It's curious that VLM can recognize complex novel objects
             | in a trained category, but cannot perform basic visual
             | tasks that human toddlers can perform (e.g. recognizing
             | when two lines intersect or when two circles overlap).
             | Nevertheless I'm sure these models can explain in great
             | detail what intersecting lines are, and even what they look
             | like. So while LLMs might have image processing
             | capabilities, they clearly do not see the way humans _see_.
             | That, I think, would be a more apt title for their
             | abstract.
        
           | kenjackson wrote:
           | Simple is a relative statement. There are vision problems
           | where monkeys are far better than humans. Some may look at
           | human vision and memory and think that we lack basic skills.
           | 
           | With AINwe are creating intelligence but with different
           | strengths and weaknesses. I think we will continue to be
           | surprised at how well they work on some problems and how poor
           | they do at some "simple" ones.
        
           | lynx23 wrote:
           | Be My Eyes user here. I disagree with your uninformed
           | opinion. Be My Eyes is more often than not more useful then a
           | human. And I am reporting from personal experience. What
           | experience do you have?
        
           | brookst wrote:
           | I don't see Be My Eyes or other similar efforts as "implied"
           | to be equivalent to humans at all. They're just new tools
           | which can be very useful for some people.
           | 
           | "These new tools aren't perfect" is the dog bites man story
           | of technology. It's certainly true, but it's no different
           | than GPS ("family drives car off cliff because GPS said to").
        
             | dartos wrote:
             | Based take
        
         | FrenchDevRemote wrote:
         | > their vision is, at best, like that of a person with myopia
         | seeing fine details as blurry
         | 
         | It's not that far from reality, most models sees images in very
         | low resolution/limited colors, so not so far from this
         | description
        
           | blackmesaind wrote:
           | My thoughts as well. I too would have trouble with the
           | overlapping lines tests if all the images underwent
           | convolution.
        
           | vikramkr wrote:
           | They didn't test that claim at all though. Vision isn't some
           | sort of 1D sliding scale with every vision condition lying
           | along one axis.
           | 
           | First of all myopia isn't 'seeing fine details as blurry' -
           | it's nearsightedness - and whatever else this post tested it
           | definitely didn't test depth perception.
           | 
           | And second - inability to see fine details is a
           | distinct/different thing from not being able to count
           | intersections and the other things tested here. That
           | hypothesis, if valid, would imply that improving the
           | resolution of the image that the model can process would
           | improve its performance on these tasks even if reasoning
           | abilities were the same. That - does not make sense. Plenty
           | of the details in these images that these models are tripping
           | up on are perfectly distinguishable at low resolutions.
           | Counting rows and columns of blank grids is not going to
           | improve with more resolution.
           | 
           | I mean, I'd argue that the phrasing of the hypothesis ("At
           | best, like that of a person with myopia") doesn't make sense
           | at all. I don't think a person with myopia would have any
           | trouble with these tasks if you zoomed into the relevant
           | area, or held the image close. I have a very strong feeling
           | that these models would continue to suffer on these tasks if
           | you zoomed in. Nearsighted != unable to count squares.
        
             | necovek wrote:
             | It seems to me they've brought up myopia only to make it
             | more approachable to people how blurry something is,
             | implying they believe models work with a blurry image just
             | like a nearsighted person sees blurry images at a distance.
             | 
             | While myopia is common, it's not the best choice of analogy
             | and "blurry vision" is probably clear enough.
             | 
             | Still, I'd only see it as a bad choice of analogy -- I
             | can't imagine anyone mistaking optical focus problems for
             | static image processing problems -- so in the usual HN
             | recommendation, I'd treat their example in the most
             | favourable sense.
        
         | jrflowers wrote:
         | >I could (well actually I can't)
         | 
         | I like the idea that these models are so good at some sort of
         | specific and secret bit of visual processing that things like
         | "counting shapes" and "beating a coin toss for accuracy"
         | shouldn't be considered when evaluating them.
        
           | vikramkr wrote:
           | Those don't really have anything to do with fine
           | detail/nearsightedness. What they measured is
           | valid/interesting - what they concluded is unrelated.
        
           | valine wrote:
           | LLMs are bad at counting things just in general. It's hard to
           | say whether the failures here are vision based or just an
           | inherent weakness of the language model.
        
         | godelski wrote:
         | > Did they try to probe that hypothesis at all?
         | 
         | I think this is a communication issue and you're being a bit
         | myopic in your interpretation. It is clearly an analogy meant
         | for communication and is not an actual hypothesis. Sure, they
         | could have used a better analogy and they could have done other
         | tests, but the paper still counters quite common claims (from
         | researchers) about VLMs.
         | 
         | > I could (well actually I can't) share some examples from my
         | job of GPT-4v doing some pretty difficult fine-grained visual
         | tasks that invalidate this.
         | 
         | I find it hard to believe that there is no example you can
         | give. It surely doesn't have to be exactly your training data.
         | If it is this good, surely you can create an example no
         | problem. If you just don't want to, that's okay, but then don't
         | say it.
         | 
         | But I have further questions. Do you have complicated
         | prompting? Or any prompt engineering? It sure does matter how
         | robust these models are to prompting. There's a huge difference
         | between a model being able to accomplish a task and a model
         | being able to perform a task in a non-very-specific
         | environment. This is no different than something working in a
         | tech demo and not in the hand of the user.
         | 
         | > But in practice, we aren't just making up tasks to trip up
         | these models.
         | 
         | I see this sentiment quite often and it is baffling to me.
         | 
         | First off, these tasks are not clearly designed to trick these
         | models. A model failing at a task is not suddenly "designed to
         | trick a model." Its common with the river crossing puzzles
         | where they're rewritten to be like "all animals can fit in the
         | boat." If that is "designed to trick a model", then the model
         | must be a stochastic parrot and not a generalist. It is very
         | important that we test things where we do know the answer to
         | because, unfortunately, we're not clairvoyant and can't test
         | questions we don't know the answer to. Which is the common case
         | in the real world usage.
         | 
         | Second, so what if a test was designed to trick up a model?
         | Shouldn't we be determining when and where models fail? Is that
         | not a critical question in understanding how to use them
         | properly? This seems doubly important if they are tasks that
         | humans don't have challenges with them.
         | 
         | > They can be very performant on some tasks and the authors
         | have not presented any real evidence about these two modes.
         | 
         | I don't think people are claiming that large models can't be
         | performant on some tasks. If they are, they're rejecting
         | trivially verifiable reality. But not every criticism and has
         | to also contain positive points. There's plenty of papers and a
         | lot of hype already doing that. And if we're going to be
         | critical of anything, shouldn't it be that the companies
         | creating these models -- selling them, and even charging
         | researchers to perform these types of experiments that the can
         | and are used to improve their products -- should be much more
         | clear about the limitations of their models? If we need
         | balance, then I think there's bigger fish to fry than Auburn
         | and Alberta Universities.
        
           | orbital-decay wrote:
           | _> I think this is a communication issue and you 're being a
           | bit myopic in your interpretation. It is clearly an analogy
           | meant for communication and is not an actual hypothesis._
           | 
           | I don't know, words have meanings. If that's a communication
           | issue, it's on part of the authors. To me, this wording in a
           | what is supposed to be a research paper abstract clearly
           | suggests the insufficient resolution as the cause. How else
           | should I interpret it?
           | 
           |  _> The shockingly poor performance of four state-of-the-art
           | VLMs suggests their vision is, at best, like that of a person
           | with myopia seeing fine details as blurry_
           | 
           | And indeed, increasing the resolution is expensive, and the
           | best VLMs have something like 1000x1000. But the low
           | resolution is clearly not the issue here, and the authors
           | don't actually talk about it in the paper.
           | 
           |  _> I find it hard to believe that there is no example you
           | can give._
           | 
           | I'm not the person you're answering to, but I actually lazily
           | tried two of authors' examples in a less performant VLM
           | (CogVLM), and was surprised it passed those, making me wonder
           | whether I can trust their conclusions until I reproduce their
           | results. LLMs and VLMs have all kinds of weird failure modes,
           | it's not a secret they fail at some trivial tasks and their
           | behavior is still not well understood. But working with these
           | models and narrowing it down is notoriously like trying to
           | nail a jelly to the wall. If I was able to do this in a
           | cursory check, what else is there? More than one research
           | paper in this area is wrong from the start.
        
             | godelski wrote:
             | > I don't know, words have meanings.
             | 
             | That's quite true. Words mean exactly what people agree
             | upon them meaning. Which does not require everyone, or else
             | slang wouldn't exist. Nor the dictionary, which
             | significantly lags. Regardless, I do not think this is even
             | an unusual use of the word, though I agree the mention of
             | myopia is. The usage makes sense if you consider that both
             | myopic and resolution have more than a singular meaning.
             | Myopic:       lacking in foresight or __discernment__ :
             | narrow in perspective and without concern for broader
             | implications            Resolution:       the process or
             | capability of making distinguishable the individual parts
             | of an object, closely adjacent optical images, or sources
             | of light
             | 
             | I agree that there are far better ways to communicate. But
             | my main gripe is that they said it was "their hypothesis."
             | If reading the abstract as a whole, I find it an odd
             | conclusion to come to. It doesn't pair with the words that
             | follow with blind guessing (and I am not trying to defend
             | the abstract. It is a bad abstract). But if you read the
             | intro and look at the context of their landing page, I find
             | it quite difficult to come to this conclusion. It is poorly
             | written, but it is still not hard to decode the key
             | concepts the authors are trying to convey.
             | 
             | I feel the need to reiterate that language has 3 key
             | aspects to it: the concept attempted to be conveyed, the
             | words that concept is lossy encoded into, and the lossy
             | decoding of the person interpreting it. Communication
             | doesn't work by you reading/listening to words and looking
             | up those words in a dictionary. Communication is a problem
             | where you use words (context/body language/symbols/etc) to
             | decrease the noise and get the reciever to reasonably
             | decode your intended message. And unfortunately we're in a
             | global world and many different factors, such as culture,
             | greatly affect how one encodes and/or decodes language. It
             | only becomes more important to recognize the fuzziness
             | around language here. Being more strict and leaning into
             | the database view of language only leads to more errors.
             | 
             | > But the low resolution is clearly not the issue here, and
             | the authors don't actually talk about it in the paper.
             | 
             | Because they didn't claim that image size and sharpness was
             | an issue. They claimed the VLM cannot resolve the images
             | "as if" they were blurry. Determining what the VLM actually
             | "sees" is quite challenging. And I'll mention that arguably
             | they did test some factors that relate to blurriness. Which
             | is why I'm willing to overlook the poor analogy.
             | 
             | > I actually lazily tried two of authors' examples in a
             | less performant VLM (CogVLM), and was surprised it passed
             | those
             | 
             | I'm not. Depending on the examples you pulled, 2 random
             | ones passing isn't unlikely given the results.
             | 
             | Something I generally do not like about these types of
             | papers is that they often do not consider augmentations.
             | Since these models tend to be quite sensitive to both the
             | text (prompt) inputs and image inputs. This is quite common
             | in generators in general. Even the way you load in and
             | scale an image can have significant performance
             | differences. I've seen significant differences in simple
             | things like loading an image from numpy, PIL, tensorflow,
             | or torch have different results. But I have to hand it to
             | these authors, they looked at some of this. In the appendix
             | they go through with confusion matrices and look at the
             | factors that determine misses. They could have gone deeper
             | and tried other things, but it is a more than reasonable
             | amount of work for a paper.
        
           | ClumsyPilot wrote:
           | > Second, so what if a test was designed to trick up a model?
           | Shouldn't we be determining when and where models fail? Is
           | that not a critical question in understanding how to use them
           | properly?
           | 
           | People are rushing to build this AI into all kinds of
           | products, and they actively don't want to know where the
           | problems are.
           | 
           | The real world outside is designed to trip up the model.
           | Strange things happen all the time.
           | 
           | Because software developers have no governing body, no oaths
           | of ethics and no spine someone will end up dead in a ditch
           | from malfunctioning AI.
        
             | TeMPOraL wrote:
             | > _The real world outside is designed to trip up the model.
             | Strange things happen all the time._
             | 
             | Counterpoint: real world is heavily sanitized towards
             | things that don't trip human visual perception up too much,
             | or otherwise inconvenience us. ML models are trained on
             | that, and for that. They're _not_ trained for dealing with
             | synthetic images, that couldn 't possibly exist in reality,
             | _and_ designed to trip visual processing algorithms up.
             | 
             | Also:
             | 
             | > _People are rushing to build this AI into all kinds of
             | products, and they actively don't want to know where the
             | problems are._
             | 
             | Glass half-full (of gasoline) take: those products will
             | trip over real-world problems, identifying them in the
             | process, and the models will get better walking over the
             | corpses of failed AI-get-rich-quick companies. The people
             | involved may not want to know where the problems are, but
             | by deploying the models, they'll reveal those problems to
             | all.
             | 
             | > _Because software developers have no governing body, no
             | oaths of ethics and no spine someone will end up dead in a
             | ditch from malfunctioning AI._
             | 
             | That, unfortunately, I 100% agree with. Though AI isn't
             | special here - not giving a fuck kills people regardless of
             | the complexity of software involved.
        
               | godelski wrote:
               | > They're not trained for dealing with synthetic images,
               | that couldn't possibly exist in reality, and designed to
               | trip visual processing algorithms up
               | 
               | Neither of these claims are true. ML is highly trained on
               | synthetic images. In fact, synthetic data generation is
               | the way forward for the scale is all you need people. And
               | there are also loads of synthetic images out in the wild.
               | Everything from line art to abstract nonsense. Just take
               | a walk down town near the bars.
               | 
               | > not giving a fuck kills people regardless of the
               | complexity of software involved.
               | 
               | What has me the most frustrated is that this "move fast
               | break things and don't bother cleaning up" attitude is
               | not only common in industry but also in academia. But
               | these two are incredibly intertwined these days and it's
               | hard to publish without support from industry because
               | people only evaluate on benchmarks. And if you're going
               | to hack your benchmarks, you just throw a shit ton of
               | compute at it. Who cares where the metrics fail?
        
             | ben_w wrote:
             | > Because software developers have no governing body, no
             | oaths of ethics and no spine someone will end up dead in a
             | ditch from malfunctioning AI.
             | 
             | The conclusion and the premise are both true, but not the
             | causality. On AI, the Overton window is mostly filled with
             | people going "this could be very bad if we get it wrong".
             | 
             | Unfortunately, there's _enough_ people who think  "unless I
             | do it first" (Musk, IMO) or "it can't possibly be harmful"
             | (LeCun) that it will indeed kill more people than it
             | already has.
             | 
             | The number who are already (and literally) "dead in a
             | ditch" is definitely above zero if you include all the
             | things that used to be AI when I was a kid e.g. "route
             | finding": https://www.cbsnews.com/news/google-sued-
             | negligence-maps-dri...
        
         | itkovian_ wrote:
         | I think gpt4o is probably doing some ocr as preprocessing. It's
         | not really controversial to say the vmls today don't pick up
         | fine grained details - we all know this. Can just look at the
         | output of a vae to know this is true.
        
           | thomasahle wrote:
           | If so, it's better than any other ocr on the market.
           | 
           | I think they just train it on a bunch of text.
           | 
           | Maybe counting squares in a grid was not probably considered
           | important enough to train for.
        
           | _flux wrote:
           | Why do you think it's probable? The much smaller llava that I
           | can run in my consumer GPU can also do "OCR", yet I don't
           | believe anyone has hidden any OCR engine inside llama.cpp.
        
         | TeMPOraL wrote:
         | Entertaining is indeed the right word. Nice job identifying
         | corner cases of models' visual processing; curiously, they're
         | not far conceptually from some optical illusions that reliably
         | trip humans up. But to call the models "blind" or imply their
         | low performance in general? That's _trivially invalidated_ by
         | just _taking your phone out and feeding a photo to ChatGPT
         | app_.
         | 
         | Like, seriously. One poster below whines about "AI apologists"
         | and BeMyEyes, but again, it's all trivially testable with your
         | phone and $20/month subscription. It works spectacularly well
         | on _real world tasks_. Not perfectly, sure, but good enough to
         | be useful _in practice_ and better than alternatives (which
         | often don 't exist).
        
         | csomar wrote:
         | > these huge GenAI models are pretty good at things
         | 
         | Is this the sales pitch though? Because 15 years ago, I had a
         | scanner with an app that can scan a text document and produce
         | the text on Windows. The machine had something like 256Mb of
         | RAM.
         | 
         | Tech can be _extremely_ good at niches in isolation. You can
         | have an OCR system 10 years ago and it 'll be extremely
         | reliable at the single task it's configured to do.
         | 
         | AI is supposed to bring a new paradigm, where the tech is not
         | limited to the specific niche the developers have scoped it to.
         | However, if it _reliably_ fails to detect simple things a
         | regular person should not get wrong, then the whole value
         | proposition is kicked out of the window.
        
         | kalkin wrote:
         | There's definitely something interesting to be learned from the
         | examples here - it's valuable work in that sense - but "VLMs
         | are blind" isn't it. That's just clickbait.
        
       | diwank wrote:
       | I think while the findings are interesting, the title "Vision
       | language models are blind" is hyperbolic and misleading of the
       | findings. It is true that VLMs have a very different way of
       | sensing and working with image inputs and for low resolutions
       | each image is split up into blocks and then mapped into discrete
       | tokens. They actually really dont have access to finegrained
       | details because this mapping is quite lossy. So, in that sense,
       | these results completely make sense and were not surprising to me
       | but "blind" is a strong implication and the work isnt really able
       | to back that.
       | 
       | Take the first example of the four 2D line plots for instance. I
       | asked Sonnet 3.5 for an answer and retried it 5 times and 2 out
       | of 5 times, it said something like this:
       | 
       |  _Me_ :
       | 
       | here's an image with four 2-D line plots. Can you tell me A) how
       | many lines are in each plot? and B) how many times do they
       | intersect each other in each plot?
       | 
       | Write your answers as: Plot 1 (leftmost): ...
       | 
       |  _Sonnet_ :
       | 
       | I'll analyze each plot from left to right and provide the
       | requested information:
       | 
       | Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0
       | intersections (the lines do not cross)
       | 
       | Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the
       | lines cross once near the right edge)
       | 
       | Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the
       | lines cross once near the center)
       | 
       | Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2
       | intersections (the lines cross twice, once near the left side and
       | once near the right side)
       | 
       | Each plot consistently shows two lines, one red and one blue,
       | with varying intersection patterns across the four plots.
       | 
       | ---
       | 
       | Which isn't completely correct (the 3rd plot has 2 intersections)
       | but still pretty good and the other 3 times, it said plot 3 has 1
       | intersection.
        
       | Jack000 wrote:
       | This is kind of the visual equivalent of asking an LLM to count
       | letters. The failure is more related to the tokenization scheme
       | than the underlying quality of the model.
       | 
       | I'm not certain about the specific models tested, but some VLMs
       | just embed the image modality into a single vector, making these
       | tasks literally impossible to solve.
        
       | JeremyHerrman wrote:
       | VLMs so far have never been good at counting objects or spatial
       | relationships (e.g. the coffee is to the right of the microwave).
       | 
       | There are ways to help the VLM out - Set of Marks [0] from
       | Microsoft being the most prominent, which uses segmentation to
       | outline and label sections of the image before sending to the
       | VLM.
       | 
       | Providing "speakable" labels to regions helps ground the visual
       | abilities of VLMs and is why in this paper the performance is so
       | much better when words are present in the grid for "Task 6:
       | Counting the rows and columns of a grid"
       | 
       | 0: https://github.com/microsoft/SoM
        
         | jazzyjackson wrote:
         | I didn't know counting objects was a problem. That's pretty
         | ironic because the very first implementation of a neural net
         | (AFAIK) is the numa-rete artificial retina developed at the
         | Biological Computer Lab [0] circa 1960. It was a parallel
         | analog computer composed of "nuerons" each with a photocell
         | that could be arranged in a grid and count "the number of
         | objects independent of their size, location and form, and
         | independent of strength of illumination" [1] - this paper may
         | be of interest to those in the field, "Perception of Form in
         | Biological and Man Made Systems" Heinz Von Foerster 1962
         | 
         | [0]
         | https://distributedmuseum.illinois.edu/exhibit/biological_co...
         | 
         | [1] https://sites.evergreen.edu/arunchandra/wp-
         | content/uploads/s...
        
           | empath75 wrote:
           | It really shouldn't be surprising that these models fail to
           | do anything that _they weren't trained to do_. It's trivially
           | easy to train a model to count stuff. The wild thing about
           | transformer based models is that their capabilities are _way_
           | beyond what you'd expect from token prediction. Figuring out
           | what their limitations actually are is interesting because
           | nobody fully knows what their limitations are.
        
             | jazzyjackson wrote:
             | I agree that these open ended transformers are way more
             | interesting and impressive than a purpose built count-the-
             | polygons model, but if the model doesn't generalize well
             | enough to figure out how to count the polygons, I can't be
             | convinced that they'll perform usefully on a more
             | sophisticated task.
             | 
             | I agree this research is really interesting, but I didn't
             | have an a priori expectation of what token prediction could
             | accomplish, so my reaction to a lot of the claims and
             | counterclaims of this new tech is that it's good at fooling
             | people and giving plausible but baseless results. It makes
             | for good research but dangerous in the hands of a market
             | attempting to exploit it.
        
               | empath75 wrote:
               | > I agree that these open ended transformers are way more
               | interesting and impressive than a purpose built count-
               | the-polygons model, but if the model doesn't generalize
               | well enough to figure out how to count the polygons, I
               | can't be convinced that they'll perform usefully on a
               | more sophisticated task.
               | 
               | I think people get really wrapped into the idea that a
               | single model needs to be able to do all the things, and
               | LLMs can do a _lot_, but there doesn't actually need to
               | be a _one model to rule them all_. If VLMs are kind of
               | okay at image intepretation but not great at details, we
               | can supplement them with something that _can_ handle the
               | details.
        
         | Eisenstein wrote:
         | Vision models use CLiP or something similar, which has no
         | conception of anything specific in the image. It sees
         | embeddings which correlate similarly to text embeddings. Take
         | an image then describe it 'there are birds sitting on a power
         | line in front of a blue sky with some clouds', get the
         | embeddings from that and the embeddings from that picture and
         | line them up. If you ask if there are birds in it, it would
         | know, but not how many, unless it was common to describe the
         | number of birds sitting on things and it happened often enough
         | that the number counted was the number in the image
         | descriptions it trained on. If you want to count objects you
         | want something like YOLO.
        
           | JeremyHerrman wrote:
           | VLMs like PaliGemma and Florence-2 support object detection
           | and segmentation, so it's becoming more common to have YOLO
           | like capabilities built into VLMs.
           | 
           | Another benefit of VLMs which support object detection is
           | that they are open vocabulary, meaning you don't have to
           | define the classes ahead of time. Additionally fine tuning
           | tends to keep the previous detection capabilities instead of
           | erasing all previous classes like fine tuning YOLO.
        
       | GaggiX wrote:
       | Well, all the models (especially Claude 3.5 Sonnet) seem to
       | perform much better than random, so they are clearly not blind.
       | The only task where Claude Sonnet 3.5 does not perform better
       | than random is the one where you have to follow many different
       | paths (the ones where the answer from A to C is 3), something
       | that would take me several seconds to solve.
       | 
       | I have the feeling that they first choose the title of the paper
       | and then run the evaluation on the new Claude 3.5 Sonnet on these
       | abstract images.
       | 
       | >their vision is, at best, like that of a person with myopia
       | seeing fine details as blurry
       | 
       | This also makes no sense, since the images evaluate the abstract
       | capabilities of the models, not their eyesight.
        
         | randcraw wrote:
         | OK. They're _legally_ blind.
        
           | GaggiX wrote:
           | This really has nothing to do with vision impairment.
        
       | iamleppert wrote:
       | This could easily be fixed with training and fine tuning. Simply
       | generate 100,000 examples or so, and train with ground truth
       | until however long you want and its a solved problem.
        
         | kristjansson wrote:
         | Solved for this benchmark... and at what cost to the rest of
         | the system?
         | 
         | These tasks are interesting because they're existence proofs of
         | generalization failure. Like the haystack problem, direct
         | solutions here are much less interesting than structural
         | improvements that address the class of failure.
        
         | imtringued wrote:
         | Ok, but most of the data is just captions for images. You're
         | going to have to invest some time into building this dataset at
         | your own expense.
        
       | _vaporwave_ wrote:
       | It's really interesting that there's a huge performance
       | discrepancy between these SOTA models. In the Olympic logo
       | example, GPT-4o is below the baseline accuracy of 20% (worse than
       | randomly guessing) while Sonnet-3.5 was correct ~76% of the time.
       | 
       | Does anyone have any technical insight or intuition as to why
       | this large variation exists?
        
         | ec109685 wrote:
         | The question wasn't "yes or no" but instead required an exact
         | number:
         | https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...
         | 
         | Playing around with GPT-4o, it knows enough to make a copy of
         | an image that is reasonable but it still can't answer the
         | questions.
         | 
         | ChatGPT went down a rabbit hole of trying to write python code,
         | but it took lots of prompting for it to notice its mistake when
         | solving one of the intersecting line questions.
        
       | londons_explore wrote:
       | Could some of the "wrong" answers be the LLM attempting to give
       | an explanation rather than the answer, eg. instead of answering
       | 'X', the LLM answers 'The letter is partially hidden by the oval,
       | so cannot be certain, but it appears to be the english letter X'.
       | 
       | The scoring criteria would rank this answer as 'T', which is
       | wrong.
        
       | simonw wrote:
       | I've been generally frustrated at the lack of analysis of vision
       | LLMs generally.
       | 
       | They're clearly a very exciting category of technology, and a
       | pretty recent one - they only got good last October with GPT-4
       | Vision, but since then we've had more vision models from
       | Anthropic and Google Gemini.
       | 
       | There's so much more information about there about text prompting
       | compared to image prompting. I feel starved for useful
       | information about their capabilities: what are vision models good
       | and bad at, and what are the best ways to put them to work?
        
         | r2_pilot wrote:
         | Why not use them yourself if you have access? I have been using
         | Claude 3.5 Sonnet for gardening recently, and while it's not
         | perfect(and can be a little blind unless you tell it to focus
         | on a specific thing), it's helped me understand how to keep my
         | plants alive in some challenging conditions(for me; this is my
         | second or third attempt at gardening so it's all challenging
         | lol). But just a experiment with it and see where the
         | capabilities lie. I do agree that certain classes of visual
         | data are challenging for it.
        
           | simonw wrote:
           | I've used them a bunch. I want to learn from other people's
           | experiences as well.
           | 
           | Some of my notes so far:
           | 
           | - https://simonwillison.net/2024/Apr/17/ai-for-data-
           | journalism... - my datasette-extract plugin, for structured
           | data from both text and images
           | 
           | - https://simonwillison.net/2024/Apr/17/ai-for-data-
           | journalism... - where they failed to extract data from a
           | handwritten scanned document in various weird ways
           | 
           | - https://simonwillison.net/2024/Feb/21/gemini-pro-video/
           | talks about video inputs to Gemini Pro (which are actually
           | image inputs, it splits them up to one frame per second)
        
         | simonw wrote:
         | Anthropic have some interesting cookbook examples that provide
         | advice on using their multimodal models here:
         | https://github.com/anthropics/anthropic-cookbook/tree/main/m...
         | 
         | I've assembled a bunch more notes here:
         | https://simonwillison.net/tags/vision-llms/
        
       | mglz wrote:
       | I tought some Computational Geometry courses and efficiently
       | computing the intersections of N line segments is not as
       | straightforward as you might initially think. Since somewhere
       | some computation must be done to recognize this and LLMs are not
       | specifically trained for this task, it's not suprising they
       | struggle.
       | 
       | In general, basic geometry seems under-explored by learning.
        
         | jordan_bonecut wrote:
         | Yes, but so is telling if a photo contains a dog or
         | understanding sentiment in a paragraph of text. Complexity
         | isn't quite the issue, I think it is that there is a
         | distinction between the type of reasoning which these models
         | have learnt and that which is necessary for concrete
         | mathematical reasoning.
        
           | slashdave wrote:
           | The models do not reason. They have learned associations,
           | because these associations have appeared in their training
           | sets.
        
         | samatman wrote:
         | > _Since somewhere some computation must be done to recognize
         | this_
         | 
         | Humans don't have a "compute intersections" ability (other than
         | a few who have learned it laboriously through algebra), we have
         | a "see things and count them" mechanism. We aren't visually
         | taking lines in a planar space and determining where they
         | cross. We know what an intersection looks like, we see one,
         | increment a counter, and find the next one. If it's less than
         | around five, we do this all at once. Otherwise we literally
         | count, sometimes in small groups, sometimes one at a time.
        
       | orbital-decay wrote:
       | That's not anything like "myopia", though.
       | 
       | FWIW I tried the line intersection and the circled letter test
       | from the article with CogVLM (which is far from reaching the
       | current SotA) and it correctly passed both. I haven't tried it
       | with Sonnet/4o but I suspect there might be something wrong with
       | how the author did their tests. Don't get me wrong, but too many
       | "the model can't do that" claims ended up with demonstrations of
       | the model doing exactly that...
        
       | nyxtom wrote:
       | I wonder how well Alpha Geometry would do on this
        
         | nybsjytm wrote:
         | AlphaGeometry is a hyper-specific system trained to add
         | auxiliary geometric objects, like extra lines, to existing
         | Euclidean geometry configurations. These prompts are not even
         | sensible inputs to AlphaGeometry.
        
       | hi_dang_ wrote:
       | I was hoping that someone in the comments talking the paper down
       | would have published a paper or have had relevant publications of
       | their own to point to. You know, meet the lads halfway sort of
       | thing.
       | 
       | So what I'm left with to judge instead is anonymous online
       | commenters vs. the publication of 2 prestigious universities.
       | Whose word do I take on this? Decisions, decisions.
       | 
       | You can swap LM out with Web3 out with NFT out with Crypto in
       | this case.
        
         | warkdarrior wrote:
         | > I'm left with [...] is anonymous online commenters vs. the
         | publication of 2 prestigious universities. Whose word do I take
         | on this?
         | 
         | Maybe you need to judge the contents of those online comments
         | and the contents of the publication, instead of relying on
         | argument from authority.
        
       | vessenes wrote:
       | A few comments below talk about how tokenizing images using stuff
       | like CLIP de-facto yields blurry image descriptions, and so these
       | are 'blind' by some definitions. Another angle of blurring not
       | much discussed is that the images are rescaled down; different
       | resolutions for different models. I wouldn't be surprised if
       | Sonnet 3.5 had a higher res base image it feeds in to the model.
       | 
       | Either way, I would guess that we'll need new model architectures
       | for multimodal to get really good at some of this, and even then
       | some of these tasks are adjacent to things that we know LLMs are
       | already bad at (numeric logic, for instance).
       | 
       | As context lengths get longer, devoting more tokens to the image
       | tokenization should help a bit here as well. Anyway, I'd
       | anticipate next year we'd see 80s and 90s for most of these
       | scores with next gen models.
        
         | imtringued wrote:
         | The problem with the current crop of projectors such as LLaVA
         | is that as far as I know do not take the previous conversation
         | into account. You only really get zero shot responses. This
         | means that you cannot steer the model towards paying attention
         | to specific instruction related details. The projector simply
         | creates a token representation of the visuals (not necessarily
         | human language tokens) and the LLM just processes that as
         | usual.
        
           | vessenes wrote:
           | The original gpt4 did this too, it had almost no memory
           | before or after the image provided. I haven't tested gpt4o on
           | this directly, but my feeling is that it's better from casual
           | usage.
           | 
           | I do think some of these thin line drawings are likely extra
           | hard to tokenize depending on the image scaling sizes for
           | tokenization. I'd wager thicker lines would help, although
           | obviously not all of this is just 'poor tokenization'.
        
         | ec109685 wrote:
         | At least for gpt 4o, it can create a facsimile of images that
         | it still can't analyze properly, so I think it's more than just
         | its "eyes" that are broken.
         | 
         | It clearly wasn't trained on this task and suffers accordingly.
         | 
         | However, with chatgpt, it will create python to do the analysis
         | and has better results.
        
       | spullara wrote:
       | in other news, vision models are bad at things they aren't
       | trained to do
        
       | akavi wrote:
       | Speaking as someone with only a tenuous grasp of how VLMs work,
       | this naively feels like a place where the "embodiement" folks
       | might have a point: Humans have the ability to "refine" their
       | perception of an image iteratively, focusing in on areas of
       | interest, while VLMs have to process the entire image at the same
       | level of fidelity.
       | 
       | I'm curious if there'd be a way to emulate this (have the visual
       | tokens be low fidelity at first, but allow the VLM to emit tokens
       | that correspond to "focusing" on a region of the image with
       | greater resolution). I'm not sure if/how it's possible to
       | performantly train a model with "interactive" data like that,
       | though
        
         | efskap wrote:
         | Isn't this the attention mechanism, the reason we're using
         | transformers for these things? Maybe not greater resolution per
         | se, but focusing on a region with greater neural connectivity
        
           | akavi wrote:
           | Ah, good point!
           | 
           | But the model is downstream of the "patch" tokenization, so
           | the cut-down in resolution (compression) of the image has
           | already occurred _prior_ to the point where the model can
           | direct greater  "attention".
           | 
           | I think the synthesis is that I'm proposing a per-pixel
           | tokenization with a transformer block whose purpose is to
           | output information at a compression level "equivalent" to
           | that of the patch tokens (is this what an autoencoder is?),
           | but where the attention vector is a function of the full
           | state of the LLM (ie, inclusive of the text surrounding the
           | image)).
           | 
           | Naively, I'd think a layer like this that is agnostic to the
           | LLM state needn't be any more computationally costly than the
           | patching computation (both are big honks of linear algebra?),
           | but idk how expensive the "full context attention" feedback
           | is...
           | 
           | (I apologize to anyone who actually understands transformers
           | for my gratuitous (ab|mis)use of terminology)
        
         | Brechreiz wrote:
         | >Humans have the ability to "refine" their perception of an
         | image iteratively
         | 
         | That's not related to embodied cognition.
        
           | akavi wrote:
           | Is embodied cognition not at least in part about
           | interactivity? I perform action (emit tokens) and receive
           | feedback (non-self-generated tokens)
        
         | kromem wrote:
         | Lots and lots of eye tracking data paired with what was being
         | looked at in order to emulate human attention processing might
         | be one of the lower hanging fruits for improving it.
        
         | caddemon wrote:
         | Humans are actually born with blurry vision as the eye takes
         | time to develop, so human learning starts with low resolution
         | images. There is a theory that this is not really a limitation
         | but a benefit in developing our visual processing systems.
         | People in poorer countries that get cataracts removed when they
         | are a bit older and should at that point hardware-wise have
         | perfect vision do still seem to have some lifelong deficits.
         | 
         | It's not entirely known how much early learning in low
         | resolution makes a difference in humans, and obviously that
         | could also relate more to our specific neurobiology than a
         | general truth about learning in connectionist systems. But I
         | found it to be an interesting idea that maybe certain outcomes
         | with ANNs could be influenced a lot by training paradigms s.t.
         | not all shortcomings could be addressed with only updates to
         | the core architecture.
        
         | slashdave wrote:
         | These models have learned to focus on specific portions of an
         | image (after all, this is the stated purpose of a transformer).
        
       | tantalor wrote:
       | Are the "random-baseline accuracy" numbers correct?
       | 
       | In the "Two circles" test, do they really have 50% chance of
       | overlapping? I think this comes from "Distances between circle
       | perimeters: -0.15 to 0.5 times the diameter" but doesn't say the
       | distribution they use.
        
         | jdlshore wrote:
         | They asked the AI a question with a yes/no response. If the AI
         | chose randomly, it would be correct 50% of the time. That's
         | what "random baseline accuracy" means.
        
       | jeromeparadis wrote:
       | One use-case I always try is to have an AI try to read a school
       | calendar image where days off are or days of interest are
       | highlighted using a legend. i.e.: days with a square, circle or
       | triangle or different color, etc.
       | 
       | When asking days for specific days of interest for the school
       | year, AIs always struggle. They get some days right but forget
       | some or fabulate new days. They fare a bit better if you remove
       | some of the noise and give them only a picture of a month but
       | even then, it's unreliable.
        
       | verbalstoner wrote:
       | It's virtually impossible to take a paper seriously when the
       | title has an emoji.
        
       | axblount wrote:
       | Would you say they have _Blindsight_?
        
       | pjs_ wrote:
       | I don't like this paper for the following reasons:
       | 
       | - The language is unnecessarily scathing
       | 
       | - They repeatedly show data where the models are getting things
       | _right_ 70, 80, 90% of the time, and then show a list of what
       | they call  "qualitative samples" (what does "qualitative" mean?
       | "cherry-picked"?) which look very bad. But it got the answer
       | right 70/80/90% of the time! That's hardly "blind"...
       | 
       | - Various of the tasks hinge on the distinction between two
       | objects "exactly touching" vs. "very nearly touching" vs. "very
       | slightly overlapping", a problem which (i) is hard for humans and
       | (ii) is particularly (presumably deliberately) sensitive to
       | resolution/precision, where we should not be surprised that
       | models fail
       | 
       | - The main fish-shaped example given in task 1 seems genuinely
       | ambiguous to me - do the lines "intersect" once or twice? The
       | tail of the fish clearly has a crossing, but the nose of the fish
       | seems a bit fishy to me... is that really an intersection?
       | 
       | - AFAIC deranged skepticism is just as bad as deranged hype, the
       | framing here is at risk of appealing to the former
       | 
       | It's absolutely fair to make the point that these models are not
       | perfect, fail a bunch of the time, and to point out the edge
       | cases where they suck. That moves the field forwards. But the
       | hyperbole (as pointed out by another commenter) is very annoying.
        
         | neuronet wrote:
         | To be fair, the paper has an emoji in the _title_, so I
         | wouldn't read it as a particularly particularly serious
         | academic study as much as the equivalent of the Gawker of AI
         | research. It is a "gotcha" paper that exploits some blind spots
         | (sorry) that will easily be patched up with a few batches of
         | training. I do think it highlights the lack of AGI in these
         | things, which some people lacking situational awareness might
         | need to see.
        
         | numeri wrote:
         | I'm also confused about some of the figures' captions, which
         | don't seem to match the results:
         | 
         | - "Only Sonnet-3.5 can count the squares in a majority of the
         | images", but Sonnet-3, Gemini-1.5 and Sonnet-3.5 all have
         | accuracy of >50%
         | 
         | - "Sonnet-3.5 tends to conservatively answer "No" regardless of
         | the actual distance between the two circles.", but it somehow
         | gets 91% accuracy? That doesn't sound like it tends to answer
         | "No" regardless of distance.
        
         | schneehertz wrote:
         | I am not sure where their experimental data came from. I tested
         | it on GPT-4o using the prompt and images they provided, and the
         | success rate was quite high, with significant differences from
         | the results they provided.
        
           | ec109685 wrote:
           | Their examples are here: https://huggingface.co/datasets/XAI/
           | vlmsareblind/viewer/defa...
           | 
           | ChatGPT whiffs completely on very obvious images.
        
       | cpill wrote:
       | I wonder how they would score if they used all 4 models and took
       | a majority vote...?
        
       | aaroninsf wrote:
       | The title for this page and argument should be qualified with the
       | specific generation of tools.
       | 
       | That's in the abstract, but, it's bad to not be specific. In this
       | case, because current public-facing models are WIWEB: the worst
       | it will ever be.
       | 
       | And there are trillion-dollar prizes at stake, so, improvement is
       | happening as quickly as it possibly can.
        
       | make3 wrote:
       | Hugged to death from my perspective. Here is a backup:
       | https://archive.ph/kOE3Q
        
         | simonw wrote:
         | That's weird - GitHub Pages serves static content and rarely
         | (in my experience) fails to load.
        
       | jetrink wrote:
       | I had a remarkable experience with GPT-4o yesterday. Our garage
       | door started to fall down recently, so I inspected it and found
       | that our landlord had installed the wire rope clips incorrectly,
       | leading to the torsion cables losing tension. I didn't know what
       | that piece of hardware was called, so I asked ChatGPT and it
       | identified the part as I expected it to. As a test, I asked if
       | there was anything notable about the photo. ChatGPT correctly
       | identified that the cables were installed backwards, with the
       | side of the cable that was (previously) under tension on top of
       | the slack end, instead of sandwiched securely in the middle. To
       | diagnose that requires tracing the cable through space and
       | inferring which end is under tension from the geometry, though I
       | can't rule out an educated guess.
       | 
       | What was really remarkable though was that it failed to notice
       | that one of the two nuts was obviously missing, even after I told
       | it there was a second problem with the installation.
       | 
       | Screenshot: https://imgur.com/a/QqCNzOM
        
         | sfink wrote:
         | A _human_ would need to trace the cable. An LLM may just be
         | responding based on (1) the fact that you 're asking about the
         | clip in the first place, and that commonly happens when there's
         | something wrong; and (2) that this is a very common failure
         | mode. This is supported by it bringing up the "never saddle a
         | dead horse" mnemonic, which suggests the issue is common.
         | 
         | After you fix it, you should try asking the same questions!
        
         | fn-mote wrote:
         | As a human, I was unable to see enough in that picture to infer
         | which side was supposed to be under tension. I'm not trained,
         | but I know what I expected to see from your description.
         | 
         | Like my sister post, I'm skeptical that the LLM didn't just get
         | lucky.
        
       | nmca wrote:
       | please use this opportunity to reflect on whether ARC measures
       | reasoning skills :)
        
       | gnutrino wrote:
       | My guess is that the systems are running image recognition
       | models, and maybe OCR on images, and then just piping that data
       | as tokens into an LLM. So you are only ever going to get results
       | as good as existing images models with the results filtered
       | through an LLM.
       | 
       | To me, this is only interesting if compared with results of image
       | recognition models that can already answer these types of
       | questions (if they even exist, I haven't looked).
       | 
       | Maybe the service is smart enough to look at the question, and
       | then choose one or more models to process the image, but not sure
       | as I can't find anything on their sites about how it works.
        
         | Eisenstein wrote:
         | > My guess is that the systems are running image recognition
         | models
         | 
         | Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an
         | idea of how they work.
        
           | gnutrino wrote:
           | Will do, thank you.
        
         | simonw wrote:
         | That's not how they work. The original GPT-4 paper has some
         | detail: https://cdn.openai.com/papers/gpt-4.pdf
         | 
         | Or read up on PaliGemma: https://github.com/google-
         | research/big_vision/blob/main/big_...
        
           | gnutrino wrote:
           | Thanks, I'll read up on this.
        
       | nichohel wrote:
       | Vision language models are blind because they lack the Cartesian
       | Theater, which you and I have. Which you and I say we have.
        
         | codeulike wrote:
         | Does the part of you that 'looks at' your cartesian theatre
         | also have a cartesian theatre?
        
         | fleshmonad wrote:
         | [citation needed]
        
       | viraptor wrote:
       | I love some of the interpretations there. For example "Fig. 10:
       | Only Sonnet-3.5 can count the squares in a majority of the
       | images.", when that model simply returns "4" for every question
       | and happens to be right.
        
       | jackblemming wrote:
       | Ask it to draw any of those things and it can.
        
       | mkoubaa wrote:
       | They interact with pixel buffers as a mathematical array. To call
       | them blind is to confuse what they doing with the experience of
       | sight...
        
         | codeulike wrote:
         | Humans 'see' by tightly packed rods and cones in the retina
         | sending signals up the optic nerve. Not as tidy as a
         | mathematical array but nonetheless not all that different.
         | Ultimately what comes to the brain from the retina can be
         | thought of as a data structure of sorts.
        
       | Rebuff5007 wrote:
       | In fairness, Mira Murati said GPT-4 is only high school level
       | [1]. Maybe it takes PhD level to understand basic shapes?
       | 
       | [1] https://www.ccn.com/news/technology/openais-gpt-5-phd-
       | level-...
        
       | jordan_bonecut wrote:
       | This is an interesting article and goes along with how I
       | understand how such models interpret input data. I'm not sure I
       | would characterize the results as blurry vision, but maybe an
       | inability to process what they see in a concrete manner.
       | 
       | All the LLMs and multi-modal models I've seen lack concrete
       | reasoning. For instance, ask ChatGPT to perform 2 tasks, to
       | summarize a chunk of text and to count how many words are in this
       | chunk. ChatGPT will do a very good job summarizing the text and
       | an awful job at counting the words. ChatGPT and all the
       | transformer based models I've seen fail at similar
       | concrete/mathematical reasoning tasks. This is the core problem
       | of creating AGI and it generally seems like no one has made any
       | progress towards synthesizing something with both a high and low
       | level of intelligence.
       | 
       | My (unproven and probably incorrect) theory is that under the
       | hood these networks lack information processing loops which make
       | recursive tasks, like solving a math problem, very difficult.
        
         | scarface_74 wrote:
         | Out of curiosity, I tried your test with ChatGPT 4o
         | 
         | https://chatgpt.com/share/79c5c6e1-e6a9-441b-acb3-54882303a8...
         | 
         | Of course as usual, LLMs are horrible with Math.
         | 
         | Funny enough, the next time it verified the word count by
         | counting it out until I specifically told it to use Python
         | 
         | https://chatgpt.com/share/79e7b922-9b0f-4df9-98d0-2cd72d7041...
        
           | infiar wrote:
           | This counting words task reminded me of a youtube video:
           | https://www.youtube.com/watch?v=-9XKiOXaHlI Maybe LLMs are
           | somehow more like monkeys.
        
         | empiricus wrote:
         | I hope you are aware of the fact that LLMs does not have direct
         | access to the stream of words/characters. It is one of the most
         | basic things to know about their implementation.
        
           | jordan_bonecut wrote:
           | Yes, but it could learn to associate tokens with word counts
           | as it could with meanings.
           | 
           | Even still, if you ask it for token count it would still
           | fail. My point is that it can't count, the circuitry required
           | to do so seems absent in these models
        
       | randomtree wrote:
       | I guess I know what's coming to every captcha tomorrow.
        
       | michaelhoney wrote:
       | This says to me that there are huge opportunities for improvement
       | in providing vision modules for LLMs. Human minds aren't made of
       | just one kind of thing: we have all sorts of hacky modular
       | capabilities - there's no reason to think that a future AGI
       | wouldn't also.
        
       | joelburget wrote:
       | Vision Transformers do a shocking amount of compression in the
       | tokenizer. In the [Chameleon
       | paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer
       | "encodes a 512 x 512 image into 1024 discrete tokens from a
       | codebook of size 8192". That's 256 pixels per token (512 * 512 /
       | 1024). If we assume that a pixel is 24 bits (3x 8 bit channels),
       | this implies that they've compressed 256 * 24 = 6144 bits into 13
       | = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction
       | and Generation](https://yucornetto.github.io/projects/titok.html)
       | pushes this even further. If these models work similarly, it's no
       | wonder they struggle with some vision tasks.
        
         | energy123 wrote:
         | GPT-4o is very good at some visual tasks like optical character
         | recognition. So the selective blindness might just be what you
         | say here -- all of its capacity is dedicated to minimizing loss
         | on a few narrow tasks that had the most training data (like
         | OCR). So it's not necessarily an inherent failure of the
         | architecture to generalize, it could just be a capacity issue
         | that will naturally be resolved with more scale.
        
           | sushid wrote:
           | Is that not just traditional OCR applied on top of LLM?
        
             | energy123 wrote:
             | It's possible they have a software layer that does that.
             | But I was assuming they don't, because the open source
             | multimodal models don't.
        
             | maxlamb wrote:
             | No it's not, it's a multimodal transformer model.
        
         | buryat wrote:
         | for some reason I started thinking about trying to describe the
         | taste of a fruit to someone who hasn't tried it as something
         | that can be similar to this as a non-visual sensory modal in
         | humans
        
         | ec109685 wrote:
         | It's not as simple as that. If you ask GPT-4o to create a copy
         | of these images, it generally creates one faithfully (e.g. an
         | image with 5 squares will be produced), so it's "seeing" things
         | reasonably enough.
         | 
         | It doesn't seem to have the logic though to answer these
         | questions.
         | 
         | The complete data set is here to play around with it yourself:
         | https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...
        
       | kristianpaul wrote:
       | We see through thoughts and memories. We see when we desire, the
       | vision just adds on a word pf thoughts and consciousness of being
       | conscious.
       | 
       | Vision links thoughts with reality
        
       | navaed01 wrote:
       | Is there a good primer on how these vision LlmS work?
        
       | yantrams wrote:
       | Tested these problems with llava-v1.6-mistral-7b and the results
       | aren't bad. Maybe I just got lucky with these samples
       | 
       | Intersecting Lines
       | https://replicate.com/p/s24aeawxasrgj0cgkzabtj53rc
       | 
       | Overlapping Circles
       | https://replicate.com/p/0w026pgbgxrgg0cgkzcv11k384
       | 
       | Touching Circles
       | https://replicate.com/p/105se4p2mnrgm0cgkzcvm83tdc
       | 
       | Circled Text https://replicate.com/p/3kdrb26nwdrgj0cgkzerez14wc
       | 
       | Nested Squares https://replicate.com/p/1ycah63hr1rgg0cgkzf99srpxm
        
         | simonw wrote:
         | These are really interesting examples, thanks for sharing.
        
           | yantrams wrote:
           | You're welcome. I recently noticed I get better performance
           | with VLMs when the queries are phrased this way - Descriptive
           | Keys instead of explaining the problem in sentences. Similar
           | to COT reasoning that many people claim gives better results,
           | I personally found querying in this sequence -
           | existenceOfEntity, numberOfEntities followed by
           | propertiesOfEntities etc tends to give better results. I
           | haven't verified any of this rigorously so please do take it
           | with a pinch of salt :)
        
       | poikroequ wrote:
       | It's ironic, they fail these seemingly simple tests that are
       | trivial even for a child to solve. Yet, I used Gemini to read a
       | postcard containing handwritten Russian cursive text with lots of
       | visual noise (postmarks and whatnot). It was able to read the
       | text and translate it into English. I didn't even need to tell it
       | the text is Russian.
       | 
       | On the one hand, it's incredible what these LLMs are capable of.
       | On the other hand, they often fall flat on their face with
       | seemingly simple problems like this. We are seeing the same from
       | self driving cars, getting into accidents in scenarios that
       | almost any human driver could have easily avoided.
        
         | slashdave wrote:
         | Simple for a child, yes. Because we have evolved our vision to
         | recognize patterns like this, because they are important for
         | survival. Reading Russian is not.
         | 
         | From an algorithmic point of view, these vision tasks are
         | actually quite difficult to explicitly program.
        
       | nothrowaways wrote:
       | The next version will solve all of it.
        
       | childintime wrote:
       | Claude 3.5 does remarkably well though on many tasks, compared to
       | the others, and on those it's not at all blind. It's getting
       | there.
        
       ___________________________________________________________________
       (page generated 2024-07-11 23:02 UTC)