[HN Gopher] Vision language models are blind
       ___________________________________________________________________
        
       Vision language models are blind
        
       Author : taesiri
       Score  : 245 points
       Date   : 2024-07-10 13:35 UTC (9 hours ago)
        
 (HTM) web link (vlmsareblind.github.io)
 (TXT) w3m dump (vlmsareblind.github.io)
        
       | taesiri wrote:
       | This paper examines the limitations of current vision-based
       | language models, such as GPT-4 and Sonnet 3.5, in performing low-
       | level vision tasks. Despite their high scores on numerous
       | multimodal benchmarks, these models often fail on very basic
       | cases. This raises a crucial question: are we evaluating these
       | models accurately?
        
       | rezaghanbari1 wrote:
       | Some of these samples are shocking. How do these models answer
       | chart-based questions, I mean when they can't even count the
       | intersections between two lines?
        
         | RodgerTheGreat wrote:
         | Same way they answer any question: piece together a
         | statistically probable sequence of words to follow the prompt.
         | All they know about an image is a handful of words a classifier
         | might choose to describe it. If those words have nothing to do
         | with the question being asked, they can't nudge the model in
         | the general direction of a correct answer, so it's a crapshoot-
         | even moreso than usual.
        
         | imtringued wrote:
         | The dataset most likely contains chart descriptions that
         | describe the raw data, but not the visual interactions of the
         | individual pixels.
        
       | dheera wrote:
       | Current approaches of multi-modal models work on embeddings and
       | tokenizations of images, which is the fundamental problem: you
       | are feeding blurry, non-precise data into the model. Yes, they
       | are blind because of exactly this.
       | 
       | An embedding isn't conceptually that much different from feeding
       | a 1024-word description of an image instead of the actual image.
       | 
       | At the moment compute power isn't good enough to feed high-res
       | pixel data into these models, unless we discover a vastly
       | different architecture, which I am also convinced likely exists.
        
         | jayd16 wrote:
         | Doesn't Gemini have a 2 million token limit for exactly this?
        
           | diwank wrote:
           | The number of tokens _per image_ are actually fairly small,
           | ranging from 85 to ~500.
        
         | visarga wrote:
         | > An embedding isn't conceptually that much different from
         | feeding a 1024-word description of an image instead of the
         | actual image.
         | 
         | An embedding needs less words. You can embed individual words,
         | phrases, like a whole prompt and longer paragraphs. You don't
         | need 1024 words for a text embed. For example a famous library
         | is called Sentence BERT (sbert).
         | 
         | When you embed images on the other hand, you cut them up into
         | little squares on the tune of 32x32 px, and embeds one of them
         | separately. chatGPT uses something like 250 tokens for smaller
         | images. So a smaller image costs about as much as 200 words if
         | represented graphically, and maybe much less words if you embed
         | a text description of it.
        
           | dheera wrote:
           | > needs less words
           | 
           | Yes I'm aware of this, and work in ML -- the thing is
           | embeddings are not designed for faithful image
           | reconstruction, and aren't even trained that way. You can
           | easily find two images that have substantially similar CLIP
           | (or whatever) embeddings that are visually very different. If
           | you query the LLM about that difference, the LLM wouldn't
           | even have the information to differentiate answers for the
           | two images if you only supply it with the embedding.
           | 
           | On the other hand, SDXL autoencoder latents passed into an
           | LLM alongside the embedding _might_ be a step up from just an
           | image embedding, since they are designed for image
           | reconstruction, but I don 't have access to the compute or
           | data resources to attempt training this.
        
       | cs702 wrote:
       | Wow, that is _embarrassingly bad performance_ for current SOTA
       | models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are
       | advertised and sold as being able to understand images, e.g., for
       | guiding the blind or tutoring children in geometry!
       | 
       | The tasks at which they fail are ridiculously simple for human
       | beings, including, for example:
       | 
       | * counting the number of times two lines intersect;
       | 
       | * detecting whether two circles overlap;
       | 
       | * selecting which letter is being circled in a word;
       | 
       | * counting the number of circles in an Olympic-like logo.
       | 
       | This should be at the top of the front page.
        
         | tensor wrote:
         | I don't see how this is "embarrassing" in the slightest. These
         | models are not human brains, and the fact that people equate
         | them with human brains is an embarrassing failure of the humans
         | more than anything about the models.
         | 
         | It's entirely unsurprising that there are numerous cases that
         | these models can't handle that are "obvious to humans." Machine
         | learning has had this property since its invention and it's a
         | classic mistake humans make dealing with these systems.
         | 
         | Humans assume that because a machine learning model has above
         | human accuracy on task X that it implies that it must also have
         | that ability at all the other tasks. While a human with amazing
         | ability at X would indeed have amazing abilities at other
         | tasks, this is not true of machine learning models The opposite
         | thinking is also wrong, that because the model can't do well on
         | task Y it must be unreliable and it's ability on task X is
         | somehow an illusion and not to be trusted.
        
           | cs702 wrote:
           | It is embarrassingly, shockingly bad, because these models
           | are _advertised_ and _sold_ as being capable of understanding
           | images.
           | 
           | Evidently, all these models still fall short.
        
             | kristjansson wrote:
             | It's surprising because these models are pretty ok at some
             | vision tasks. The existence of a clear failure mode is
             | interesting and informative, not embarrassing.
        
             | knowaveragejoe wrote:
             | Not only are they capable of understanding images(the kind
             | people might actually feed into such a system -
             | photographs), but they're pretty good at it.
             | 
             | A modern robot would struggle to fold socks and put them in
             | a drawer, but they're great at making cars.
        
               | pixl97 wrote:
               | I mean, with some of the recent demos, robots have got a
               | lot better at folding stuff and putting it up. Not saying
               | it's anywhere close to human level, but it has taken a
               | pretty massive leap from being a joke just a few years
               | ago.
        
             | startupsfail wrote:
             | Humans are also shockingly bad on these tasks. And guess
             | where the labeling was coming from...
        
             | simonw wrote:
             | I see this complaint about LLMs all the time - that they're
             | advertised as being infallible but fail the moment you give
             | them a simple logic puzzle or ask for a citation.
             | 
             | And yet... every interface to every LLM has a "ChatGPT can
             | make mistakes. Check important info." style disclaimer.
             | 
             | The hype around this stuff may be deafening, but it's often
             | not entirely the direct fault of the model vendors
             | themselves, who even put out lengthy papers describing
             | their many flaws.
        
               | jazzyjackson wrote:
               | There's evidently a large gap between what researchers
               | publish, the disclaimers a vendor makes, and what gets
               | broadcast on CNBC, no surprise there.
        
               | jampekka wrote:
               | A bit like how Tesla Full Self-Driving is not to be used
               | as self-driving. Or any other small print. Or ads in
               | general. Lying by deliberately giving the wrong
               | impression.
        
           | mrbungie wrote:
           | These models are marketed as being able to guide the blind or
           | tutoring children using direct camera access.
           | 
           | Promoting those use cases and models failing in these ways is
           | irresponsible. So, yeah, maybe the models are not embarrasing
           | but the hype definitely is.
        
             | cs702 wrote:
             | _> Promoting those use cases and models failing in these
             | ways is irresponsible._
             | 
             | Yes, _exactly_.
        
           | scotty79 wrote:
           | You'd expect them to be trained on simple geometry since you
           | can create arbitrarily large synthetic training set for that.
        
           | sfink wrote:
           | Well said.
           | 
           | It doesn't matter how they are marketed or described or held
           | up to some standard generated by wishful thinking. And it
           | especially doesn't matter what it would mean if a human were
           | to make the same error.
           | 
           | It matters what they are, what they're doing, and how they're
           | doing it. Feel free to be embarrassed if _you_ are claiming
           | they can do what they can 't and are maybe even selling them
           | on that basis. But there's nothing embarrassing about their
           | current set of capabilities. They are very good at what they
           | are very good at. Expecting those capabilities to generalize
           | as they would if they were human is like getting embarrassed
           | that your screwdriver can't pound in a nail, when it is ever
           | so good at driving in screws.
        
           | insane_dreamer wrote:
           | > is an embarrassing failure of the humans more than anything
           | about the models
           | 
           | No, it's a failure of the companies who are advertising them
           | as capable of doing something which they are not (assisting
           | people with low vision)
        
             | simonw wrote:
             | But they CAN assist people with low vision. I've talked to
             | someone who's been using a product based on GPT-4o and
             | absolutely loves it.
             | 
             | Low vision users understand the limitations of
             | accessibility technology better than anyone else. They will
             | VERY quickly figure out what this tech can be used for
             | effectively and what it can't.
        
       | sweezyjeezy wrote:
       | Entertaining, but I think the conclusion is way off.
       | 
       | > their vision is, at best, like that of a person with myopia
       | seeing fine details as blurry
       | 
       | is a crazy thing to write in an abstract. Did they try to probe
       | that hypothesis at all? I could (well actually I can't) share
       | some examples from my job of GPT-4v doing some pretty difficult
       | fine-grained visual tasks that invalidate this.
       | 
       | Personally, I rate this paper [1], which makes the argument that
       | these huge GenAI models are pretty good at things - _assuming
       | that it has seen a LOT of that type of data during training_
       | (which is true of a great many things). If you make up tasks like
       | this, then yes can be REALLY bad at them, and initial impressions
       | of AGI get harder to justify. But in practice, we aren 't just
       | making up tasks to trip up these models. They can be very
       | performant on some tasks and the authors have not presented any
       | real evidence about these two modes.
       | 
       | [1] https://arxiv.org/abs/2404.04125
        
         | diwank wrote:
         | Yeah I think their findings are def interesting but the title
         | and the strong claims are a tad hyperbolic.
        
         | SomaticPirate wrote:
         | There are quite a few "ai apologists" in the comments but I
         | think the title is fair when these models are marketed towards
         | low vision people ("Be my eyes"
         | https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent
         | to human vision. These models are implied to be human level
         | equivalents when they are not.
         | 
         | This paper demonstrates that there are still some major gaps
         | where simple problems confound the models in unexpected ways.
         | These is important work to elevate otherwise people may start
         | to believe that these models are suitable for general
         | application when they still need safeguards and copious
         | warnings.
        
           | sweezyjeezy wrote:
           | The paper I linked should hopefully mark me out as far from
           | an AI apologist, it's actually really bad news for GenAI if
           | correct. All I mean to say is the clickbait conclusion and
           | the evidence do not match up.
        
             | Melomololotolo wrote:
             | We have started the ara of ai.
             | 
             | It really doesn't matter how good current llms are.
             | 
             | They have been good enough to start this ara.
             | 
             | And no it's not and never has been just llms. Look what
             | Nvidia is doing with ml.
             | 
             | Whisper huge advantage, segment anything again huge. Alpha
             | fold 2 again huge.
             | 
             | All the robot announcements -> huge
             | 
             | I doubt we will reach agi just through llms. We will reach
             | agi through multi modal, mix of experts, some kind of
             | feedback loop, etc.
             | 
             | But the stone started to roll.
             | 
             | And you know I prefer to hear about ai advantages for the
             | next 10-30 years. That's a lot better than the crypto shit
             | we had the last 5 years.
        
           | pixl97 wrote:
           | Yea, really if you look at human learning/seeing/acting there
           | is a feedback loop that LLM for example isn't able to
           | complete and train on.
           | 
           | You see an object. First you have to learn how to control all
           | your body functions to move toward it and grasp it. This
           | teaches you about the 3 dimensional world and things like
           | gravity. You may not know the terms, but it is baked in your
           | learning model. After you get an object you start building a
           | classification list "hot", "sharp", "soft and fuzzy",
           | "tasty", "slick". Your learning model builds up a list of
           | properties of objects and "expected" properties of objects.
           | 
           | Once you have this 'database' you create as a human, you can
           | apply the logic to achieve tasks. "Walk 10 feet forward, but
           | avoid the sharp glass just to the left". You have to have
           | spatial awareness, object awareness, and prediction ability.
           | 
           | Models 'kind of' have this, but its seemingly haphazard, kind
           | of like a child that doesn't know how to put all the pieces
           | together yet. I think a lot of embodied robot testing where
           | the embodied model feeds back training to the LLM/vision
           | model will have to occur before this is even somewhat close
           | to reliable.
        
           | Melomololotolo wrote:
           | Ah yes the blind person who constantly needs to know if two
           | lines intersect.
           | 
           | Let's just ignore what a blind person normally needs to know.
           | 
           | You know what blind people ask? Sometimes there daily routine
           | is broken because there is some type of construction and
           | models can tell you this.
           | 
           | Sometimes they need to read a basic sign and models can do
           | this.
           | 
           | Those models help people already and they will continue to
           | get better.
           | 
           | I'm not sure if I'm more frustrated how condescending the
           | authors are or your ignorance.
           | 
           | Valid criticism doesn't need to be shitty
        
             | shagie wrote:
             | As an aside... from 2016 this is what was a valid use case
             | for a blind person with an app.
             | 
             | Seeing AI 2016 Prototype - A Microsoft research project -
             | https://youtu.be/R2mC-NUAmMk
             | 
             | https://www.seeingai.com are the actual working apps.
             | 
             | The version from 2016 I recall showing (pun not intended)
             | to a coworker who had some significant vision impairments
             | and he was _really_ excited about what it could do back
             | then.
             | 
             | ---
             | 
             | I still remain quite impressed with its ability to parse
             | the picture and likely reason behind it
             | https://imgur.com/a/JZBTk2t
        
           | benreesman wrote:
           | If we're throwing "citation needed" tags on stuff, how about
           | the first sentence?
           | 
           | "Large language models with vision capabilities (VLMs), e.g.,
           | GPT-4o and Gemini-1.5 Pro are powering countless image-text
           | processing applications"
           | 
           | I don't know how many a "countless" is, but I think we've
           | gotten really sloppy in terms of what counts for LLMs as a
           | demonstrated, durable win in a concrete task attached to
           | well-measured outcomes and holding up over even modest
           | periods of time.
           | 
           | This stuff is really promising and lots of builders are
           | making lots of nifty things, so if that counts as an
           | application then maybe we're at countless, but in the
           | enterprise and in government and in refereed academic
           | literature we seem to be at the proof-of-concept phase.
           | Impressive chat bots as a use case are pretty dialed in,
           | enough people claim that they help with coding that I tend to
           | believe it's a real thing (I never seem to come out ahead of
           | going directly to the source, StackOverflow).
           | 
           | The amount of breathless press on this seems "countless", so
           | maybe I missed the totally rigorous case study on how X
           | company became Y percent more profitable by doing Z thing
           | with LLMs (or similar), and if so I'd be grateful for
           | citations, but neither Google nor any of the big models seem
           | to know about it.
        
           | Lerc wrote:
           | I disagree. I think the title, abstract, and conclusion not
           | only misrepresents the state of the models but it
           | misrepresents Thier own findings.
           | 
           | They have identified a class of problems that the models
           | perform poorly at and have given a good description of the
           | failure. They portray this as a representative example of the
           | behaviour in general. This has not been shown and is probably
           | not true.
           | 
           | I don't think that models have been portrayed as equivalent
           | to humans. Like most AI in it has been shown as vastly
           | superior in some areas and profoundly ignorant in others.
           | Media can overblow things and enthusiasts can talk about
           | future advances as if they have already arrived, but I don't
           | think these are typical portayals by the AI Field in general.
        
         | FrenchDevRemote wrote:
         | > their vision is, at best, like that of a person with myopia
         | seeing fine details as blurry
         | 
         | It's not that far from reality, most models sees images in very
         | low resolution/limited colors, so not so far from this
         | description
        
           | blackmesaind wrote:
           | My thoughts as well. I too would have trouble with the
           | overlapping lines tests if all the images underwent
           | convolution.
        
           | vikramkr wrote:
           | They didn't test that claim at all though. Vision isn't some
           | sort of 1D sliding scale with every vision condition lying
           | along one axis.
           | 
           | First of all myopia isn't 'seeing fine details as blurry' -
           | it's nearsightedness - and whatever else this post tested it
           | definitely didn't test depth perception.
           | 
           | And second - inability to see fine details is a
           | distinct/different thing from not being able to count
           | intersections and the other things tested here. That
           | hypothesis, if valid, would imply that improving the
           | resolution of the image that the model can process would
           | improve its performance on these tasks even if reasoning
           | abilities were the same. That - does not make sense. Plenty
           | of the details in these images that these models are tripping
           | up on are perfectly distinguishable at low resolutions.
           | Counting rows and columns of blank grids is not going to
           | improve with more resolution.
           | 
           | I mean, I'd argue that the phrasing of the hypothesis ("At
           | best, like that of a person with myopia") doesn't make sense
           | at all. I don't think a person with myopia would have any
           | trouble with these tasks if you zoomed into the relevant
           | area, or held the image close. I have a very strong feeling
           | that these models would continue to suffer on these tasks if
           | you zoomed in. Nearsighted != unable to count squares.
        
         | jrflowers wrote:
         | >I could (well actually I can't)
         | 
         | I like the idea that these models are so good at some sort of
         | specific and secret bit of visual processing that things like
         | "counting shapes" and "beating a coin toss for accuracy"
         | shouldn't be considered when evaluating them.
        
           | vikramkr wrote:
           | Those don't really have anything to do with fine
           | detail/nearsightedness. What they measured is
           | valid/interesting - what they concluded is unrelated.
        
           | valine wrote:
           | LLMs are bad at counting things just in general. It's hard to
           | say whether the failures here are vision based or just an
           | inherent weakness of the language model.
        
         | godelski wrote:
         | > Did they try to probe that hypothesis at all?
         | 
         | I think this is a communication issue and you're being a bit
         | myopic in your interpretation. It is clearly an analogy meant
         | for communication and is not an actual hypothesis. Sure, they
         | could have used a better analogy and they could have done other
         | tests, but the paper still counters quite common claims (from
         | researchers) about VLMs.
         | 
         | > I could (well actually I can't) share some examples from my
         | job of GPT-4v doing some pretty difficult fine-grained visual
         | tasks that invalidate this.
         | 
         | I find it hard to believe that there is no example you can
         | give. It surely doesn't have to be exactly your training data.
         | If it is this good, surely you can create an example no
         | problem. If you just don't want to, that's okay, but then don't
         | say it.
         | 
         | But I have further questions. Do you have complicated
         | prompting? Or any prompt engineering? It sure does matter how
         | robust these models are to prompting. There's a huge difference
         | between a model being able to accomplish a task and a model
         | being able to perform a task in a non-very-specific
         | environment. This is no different than something working in a
         | tech demo and not in the hand of the user.
         | 
         | > But in practice, we aren't just making up tasks to trip up
         | these models.
         | 
         | I see this sentiment quite often and it is baffling to me.
         | 
         | First off, these tasks are not clearly designed to trick these
         | models. A model failing at a task is not suddenly "designed to
         | trick a model." Its common with the river crossing puzzles
         | where they're rewritten to be like "all animals can fit in the
         | boat." If that is "designed to trick a model", then the model
         | must be a stochastic parrot and not a generalist. It is very
         | important that we test things where we do know the answer to
         | because, unfortunately, we're not clairvoyant and can't test
         | questions we don't know the answer to. Which is the common case
         | in the real world usage.
         | 
         | Second, so what if a test was designed to trick up a model?
         | Shouldn't we be determining when and where models fail? Is that
         | not a critical question in understanding how to use them
         | properly? This seems doubly important if they are tasks that
         | humans don't have challenges with them.
         | 
         | > They can be very performant on some tasks and the authors
         | have not presented any real evidence about these two modes.
         | 
         | I don't think people are claiming that large models can't be
         | performant on some tasks. If they are, they're rejecting
         | trivially verifiable reality. But not every criticism and has
         | to also contain positive points. There's plenty of papers and a
         | lot of hype already doing that. And if we're going to be
         | critical of anything, shouldn't it be that the companies
         | creating these models -- selling them, and even charging
         | researchers to perform these types of experiments that the can
         | and are used to improve their products -- should be much more
         | clear about the limitations of their models? If we need
         | balance, then I think there's bigger fish to fry than Auburn
         | and Alberta Universities.
        
           | orbital-decay wrote:
           | _> I think this is a communication issue and you 're being a
           | bit myopic in your interpretation. It is clearly an analogy
           | meant for communication and is not an actual hypothesis._
           | 
           | I don't know, words have meanings. If that's a communication
           | issue, it's on part of the authors. To me, this wording in a
           | what is supposed to be a research paper abstract clearly
           | suggests the insufficient resolution as the cause. How else
           | should I interpret it?
           | 
           |  _> The shockingly poor performance of four state-of-the-art
           | VLMs suggests their vision is, at best, like that of a person
           | with myopia seeing fine details as blurry_
           | 
           | And indeed, increasing the resolution is expensive, and the
           | best VLMs have something like 1000x1000. But the low
           | resolution is clearly not the issue here, and the authors
           | don't actually talk about it in the paper.
           | 
           |  _> I find it hard to believe that there is no example you
           | can give._
           | 
           | I'm not the person you're answering to, but I actually lazily
           | tried two of authors' examples in a less performant VLM
           | (CogVLM), and was surprised it passed those, making me wonder
           | whether I can trust their conclusions until I reproduce their
           | results. LLMs and VLMs have all kinds of weird failure modes,
           | it's not a secret they fail at some trivial tasks and their
           | behavior is still not well understood. But working with these
           | models and narrowing it down is notoriously like trying to
           | nail a jelly to the wall. If I was able to do this in a
           | cursory check, what else is there? More than one research
           | paper in this area is wrong from the start.
        
       | diwank wrote:
       | I think while the findings are interesting, the title "Vision
       | language models are blind" is hyperbolic and misleading of the
       | findings. It is true that VLMs have a very different way of
       | sensing and working with image inputs and for low resolutions
       | each image is split up into blocks and then mapped into discrete
       | tokens. They actually really dont have access to finegrained
       | details because this mapping is quite lossy. So, in that sense,
       | these results completely make sense and were not surprising to me
       | but "blind" is a strong implication and the work isnt really able
       | to back that.
       | 
       | Take the first example of the four 2D line plots for instance. I
       | asked Sonnet 3.5 for an answer and retried it 5 times and 2 out
       | of 5 times, it said something like this:
       | 
       |  _Me_ :
       | 
       | here's an image with four 2-D line plots. Can you tell me A) how
       | many lines are in each plot? and B) how many times do they
       | intersect each other in each plot?
       | 
       | Write your answers as: Plot 1 (leftmost): ...
       | 
       |  _Sonnet_ :
       | 
       | I'll analyze each plot from left to right and provide the
       | requested information:
       | 
       | Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0
       | intersections (the lines do not cross)
       | 
       | Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the
       | lines cross once near the right edge)
       | 
       | Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the
       | lines cross once near the center)
       | 
       | Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2
       | intersections (the lines cross twice, once near the left side and
       | once near the right side)
       | 
       | Each plot consistently shows two lines, one red and one blue,
       | with varying intersection patterns across the four plots.
       | 
       | ---
       | 
       | Which isn't completely correct (the 3rd plot has 2 intersections)
       | but still pretty good and the other 3 times, it said plot 3 has 1
       | intersection.
        
       | Jack000 wrote:
       | This is kind of the visual equivalent of asking an LLM to count
       | letters. The failure is more related to the tokenization scheme
       | than the underlying quality of the model.
       | 
       | I'm not certain about the specific models tested, but some VLMs
       | just embed the image modality into a single vector, making these
       | tasks literally impossible to solve.
        
       | JeremyHerrman wrote:
       | VLMs so far have never been good at counting objects or spatial
       | relationships (e.g. the coffee is to the right of the microwave).
       | 
       | There are ways to help the VLM out - Set of Marks [0] from
       | Microsoft being the most prominent, which uses segmentation to
       | outline and label sections of the image before sending to the
       | VLM.
       | 
       | Providing "speakable" labels to regions helps ground the visual
       | abilities of VLMs and is why in this paper the performance is so
       | much better when words are present in the grid for "Task 6:
       | Counting the rows and columns of a grid"
       | 
       | 0: https://github.com/microsoft/SoM
        
         | jazzyjackson wrote:
         | I didn't know counting objects was a problem. That's pretty
         | ironic because the very first implementation of a neural net
         | (AFAIK) is the numa-rete artificial retina developed at the
         | Biological Computer Lab [0] circa 1960. It was a parallel
         | analog computer composed of "nuerons" each with a photocell
         | that could be arranged in a grid and count "the number of
         | objects independent of their size, location and form, and
         | independent of strength of illumination" [1] - this paper may
         | be of interest to those in the field, "Perception of Form in
         | Biological and Man Made Systems" Heinz Von Foerster 1962
         | 
         | [0]
         | https://distributedmuseum.illinois.edu/exhibit/biological_co...
         | 
         | [1] https://sites.evergreen.edu/arunchandra/wp-
         | content/uploads/s...
        
           | empath75 wrote:
           | It really shouldn't be surprising that these models fail to
           | do anything that _they weren't trained to do_. It's trivially
           | easy to train a model to count stuff. The wild thing about
           | transformer based models is that their capabilities are _way_
           | beyond what you'd expect from token prediction. Figuring out
           | what their limitations actually are is interesting because
           | nobody fully knows what their limitations are.
        
             | jazzyjackson wrote:
             | I agree that these open ended transformers are way more
             | interesting and impressive than a purpose built count-the-
             | polygons model, but if the model doesn't generalize well
             | enough to figure out how to count the polygons, I can't be
             | convinced that they'll perform usefully on a more
             | sophisticated task.
             | 
             | I agree this research is really interesting, but I didn't
             | have an a priori expectation of what token prediction could
             | accomplish, so my reaction to a lot of the claims and
             | counterclaims of this new tech is that it's good at fooling
             | people and giving plausible but baseless results. It makes
             | for good research but dangerous in the hands of a market
             | attempting to exploit it.
        
         | Eisenstein wrote:
         | Vision models use CLiP or something similar, which has no
         | conception of anything specific in the image. It sees
         | embeddings which correlate similarly to text embeddings. Take
         | an image then describe it 'there are birds sitting on a power
         | line in front of a blue sky with some clouds', get the
         | embeddings from that and the embeddings from that picture and
         | line them up. If you ask if there are birds in it, it would
         | know, but not how many, unless it was common to describe the
         | number of birds sitting on things and it happened often enough
         | that the number counted was the number in the image
         | descriptions it trained on. If you want to count objects you
         | want something like YOLO.
        
           | JeremyHerrman wrote:
           | VLMs like PaliGemma and Florence-2 support object detection
           | and segmentation, so it's becoming more common to have YOLO
           | like capabilities built into VLMs.
           | 
           | Another benefit of VLMs which support object detection is
           | that they are open vocabulary, meaning you don't have to
           | define the classes ahead of time. Additionally fine tuning
           | tends to keep the previous detection capabilities instead of
           | erasing all previous classes like fine tuning YOLO.
        
       | GaggiX wrote:
       | Well, all the models (especially Claude 3.5 Sonnet) seem to
       | perform much better than random, so they are clearly not blind.
       | The only task where Claude Sonnet 3.5 does not perform better
       | than random is the one where you have to follow many different
       | paths (the ones where the answer from A to C is 3), something
       | that would take me several seconds to solve.
       | 
       | I have the feeling that they first choose the title of the paper
       | and then run the evaluation on the new Claude 3.5 Sonnet on these
       | abstract images.
       | 
       | >their vision is, at best, like that of a person with myopia
       | seeing fine details as blurry
       | 
       | This also makes no sense, since the images evaluate the abstract
       | capabilities of the models, not their eyesight.
        
         | randcraw wrote:
         | OK. They're _legally_ blind.
        
           | GaggiX wrote:
           | This really has nothing to do with vision impairment.
        
       | iamleppert wrote:
       | This could easily be fixed with training and fine tuning. Simply
       | generate 100,000 examples or so, and train with ground truth
       | until however long you want and its a solved problem.
        
         | kristjansson wrote:
         | Solved for this benchmark... and at what cost to the rest of
         | the system?
         | 
         | These tasks are interesting because they're existence proofs of
         | generalization failure. Like the haystack problem, direct
         | solutions here are much less interesting than structural
         | improvements that address the class of failure.
        
         | imtringued wrote:
         | Ok, but most of the data is just captions for images. You're
         | going to have to invest some time into building this dataset at
         | your own expense.
        
       | _vaporwave_ wrote:
       | It's really interesting that there's a huge performance
       | discrepancy between these SOTA models. In the Olympic logo
       | example, GPT-4o is below the baseline accuracy of 20% (worse than
       | randomly guessing) while Sonnet-3.5 was correct ~76% of the time.
       | 
       | Does anyone have any technical insight or intuition as to why
       | this large variation exists?
        
       | londons_explore wrote:
       | Could some of the "wrong" answers be the LLM attempting to give
       | an explanation rather than the answer, eg. instead of answering
       | 'X', the LLM answers 'The letter is partially hidden by the oval,
       | so cannot be certain, but it appears to be the english letter X'.
       | 
       | The scoring criteria would rank this answer as 'T', which is
       | wrong.
        
       | simonw wrote:
       | I've been generally frustrated at the lack of analysis of vision
       | LLMs generally.
       | 
       | They're clearly a very exciting category of technology, and a
       | pretty recent one - they only got good last October with GPT-4
       | Vision, but since then we've had more vision models from
       | Anthropic and Google Gemini.
       | 
       | There's so much more information about there about text prompting
       | compared to image prompting. I feel starved for useful
       | information about their capabilities: what are vision models good
       | and bad at, and what are the best ways to put them to work?
        
         | r2_pilot wrote:
         | Why not use them yourself if you have access? I have been using
         | Claude 3.5 Sonnet for gardening recently, and while it's not
         | perfect(and can be a little blind unless you tell it to focus
         | on a specific thing), it's helped me understand how to keep my
         | plants alive in some challenging conditions(for me; this is my
         | second or third attempt at gardening so it's all challenging
         | lol). But just a experiment with it and see where the
         | capabilities lie. I do agree that certain classes of visual
         | data are challenging for it.
        
           | simonw wrote:
           | I've used them a bunch. I want to learn from other people's
           | experiences as well.
           | 
           | Some of my notes so far:
           | 
           | - https://simonwillison.net/2024/Apr/17/ai-for-data-
           | journalism... - my datasette-extract plugin, for structured
           | data from both text and images
           | 
           | - https://simonwillison.net/2024/Apr/17/ai-for-data-
           | journalism... - where they failed to extract data from a
           | handwritten scanned document in various weird ways
           | 
           | - https://simonwillison.net/2024/Feb/21/gemini-pro-video/
           | talks about video inputs to Gemini Pro (which are actually
           | image inputs, it splits them up to one frame per second)
        
         | simonw wrote:
         | Anthropic have some interesting cookbook examples that provide
         | advice on using their multimodal models here:
         | https://github.com/anthropics/anthropic-cookbook/tree/main/m...
         | 
         | I've assembled a bunch more notes here:
         | https://simonwillison.net/tags/vision-llms/
        
       | mglz wrote:
       | I tought some Computational Geometry courses and efficiently
       | computing the intersections of N line segments is not as
       | straightforward as you might initially think. Since somewhere
       | some computation must be done to recognize this and LLMs are not
       | specifically trained for this task, it's not suprising they
       | struggle.
       | 
       | In general, basic geometry seems under-explored by learning.
        
       | orbital-decay wrote:
       | That's not anything like "myopia", though.
       | 
       | FWIW I tried the line intersection and the circled letter test
       | from the article with CogVLM (which is far from reaching the
       | current SotA) and it correctly passed both. I haven't tried it
       | with Sonnet/4o but I suspect there might be something wrong with
       | how the author did their tests. Don't get me wrong, but too many
       | "the model can't do that" claims ended up with demonstrations of
       | the model doing exactly that...
        
       | nyxtom wrote:
       | I wonder how well Alpha Geometry would do on this
        
         | nybsjytm wrote:
         | AlphaGeometry is a hyper-specific system trained to add
         | auxiliary geometric objects, like extra lines, to existing
         | Euclidean geometry configurations. These prompts are not even
         | sensible inputs to AlphaGeometry.
        
       | hi_dang_ wrote:
       | I was hoping that someone in the comments talking the paper down
       | would have published a paper or have had relevant publications of
       | their own to point to. You know, meet the lads halfway sort of
       | thing.
       | 
       | So what I'm left with to judge instead is anonymous online
       | commenters vs. the publication of 2 prestigious universities.
       | Whose word do I take on this? Decisions, decisions.
       | 
       | You can swap LM out with Web3 out with NFT out with Crypto in
       | this case.
        
         | warkdarrior wrote:
         | > I'm left with [...] is anonymous online commenters vs. the
         | publication of 2 prestigious universities. Whose word do I take
         | on this?
         | 
         | Maybe you need to judge the contents of those online comments
         | and the contents of the publication, instead of relying on
         | argument from authority.
        
       | vessenes wrote:
       | A few comments below talk about how tokenizing images using stuff
       | like CLIP de-facto yields blurry image descriptions, and so these
       | are 'blind' by some definitions. Another angle of blurring not
       | much discussed is that the images are rescaled down; different
       | resolutions for different models. I wouldn't be surprised if
       | Sonnet 3.5 had a higher res base image it feeds in to the model.
       | 
       | Either way, I would guess that we'll need new model architectures
       | for multimodal to get really good at some of this, and even then
       | some of these tasks are adjacent to things that we know LLMs are
       | already bad at (numeric logic, for instance).
       | 
       | As context lengths get longer, devoting more tokens to the image
       | tokenization should help a bit here as well. Anyway, I'd
       | anticipate next year we'd see 80s and 90s for most of these
       | scores with next gen models.
        
         | imtringued wrote:
         | The problem with the current crop of projectors such as LLaVA
         | is that as far as I know do not take the previous conversation
         | into account. You only really get zero shot responses. This
         | means that you cannot steer the model towards paying attention
         | to specific instruction related details. The projector simply
         | creates a token representation of the visuals (not necessarily
         | human language tokens) and the LLM just processes that as
         | usual.
        
           | vessenes wrote:
           | The original gpt4 did this too, it had almost no memory
           | before or after the image provided. I haven't tested gpt4o on
           | this directly, but my feeling is that it's better from casual
           | usage.
           | 
           | I do think some of these thin line drawings are likely extra
           | hard to tokenize depending on the image scaling sizes for
           | tokenization. I'd wager thicker lines would help, although
           | obviously not all of this is just 'poor tokenization'.
        
       | spullara wrote:
       | in other news, vision models are bad at things they aren't
       | trained to do
        
       | akavi wrote:
       | Speaking as someone with only a tenuous grasp of how VLMs work,
       | this naively feels like a place where the "embodiement" folks
       | might have a point: Humans have the ability to "refine" their
       | perception of an image iteratively, focusing in on areas of
       | interest, while VLMs have to process the entire image at the same
       | level of fidelity.
       | 
       | I'm curious if there'd be a way to emulate this (have the visual
       | tokens be low fidelity at first, but allow the VLM to emit tokens
       | that correspond to "focusing" on a region of the image with
       | greater resolution). I'm not sure if/how it's possible to
       | performantly train a model with "interactive" data like that,
       | though
        
         | efskap wrote:
         | Isn't this the attention mechanism, the reason we're using
         | transformers for these things? Maybe not greater resolution per
         | se, but focusing on a region with greater neural connectivity
        
           | akavi wrote:
           | Ah, good point!
           | 
           | But the model is downstream of the "patch" tokenization, so
           | the cut-down in resolution (compression) of the image has
           | already occurred _prior_ to the point where the model can
           | direct greater  "attention".
           | 
           | I think the synthesis is that I'm proposing a per-pixel
           | tokenization with a transformer block whose purpose is to
           | output information at a compression level "equivalent" to
           | that of the patch tokens (is this what an autoencoder is?),
           | but where the attention vector is a function of the full
           | state of the LLM (ie, inclusive of the text surrounding the
           | image)).
           | 
           | Naively, I'd think a layer like this that is agnostic to the
           | LLM state needn't be any more computationally costly than the
           | patching computation (both are big honks of linear algebra?),
           | but idk how expensive the "full context attention" feedback
           | is...
           | 
           | (I apologize to anyone who actually understands transformers
           | for my gratuitous (ab|mis)use of terminology)
        
         | Brechreiz wrote:
         | >Humans have the ability to "refine" their perception of an
         | image iteratively
         | 
         | That's not related to embodied cognition.
        
           | akavi wrote:
           | Is embodied cognition not at least in part about
           | interactivity? I perform action (emit tokens) and receive
           | feedback (non-self-generated tokens)
        
         | kromem wrote:
         | Lots and lots of eye tracking data paired with what was being
         | looked at in order to emulate human attention processing might
         | be one of the lower hanging fruits for improving it.
        
       | tantalor wrote:
       | Are the "random-baseline accuracy" numbers correct?
       | 
       | In the "Two circles" test, do they really have 50% chance of
       | overlapping? I think this comes from "Distances between circle
       | perimeters: -0.15 to 0.5 times the diameter" but doesn't say the
       | distribution they use.
        
         | jdlshore wrote:
         | They asked the AI a question with a yes/no response. If the AI
         | chose randomly, it would be correct 50% of the time. That's
         | what "random baseline accuracy" means.
        
       | jeromeparadis wrote:
       | One use-case I always try is to have an AI try to read a school
       | calendar image where days off are or days of interest are
       | highlighted using a legend. i.e.: days with a square, circle or
       | triangle or different color, etc.
       | 
       | When asking days for specific days of interest for the school
       | year, AIs always struggle. They get some days right but forget
       | some or fabulate new days. They fare a bit better if you remove
       | some of the noise and give them only a picture of a month but
       | even then, it's unreliable.
        
       | verbalstoner wrote:
       | It's virtually impossible to take a paper seriously when the
       | title has an emoji.
        
       | axblount wrote:
       | Would you say they have _Blindsight_?
        
       | pjs_ wrote:
       | I don't like this paper for the following reasons:
       | 
       | - The language is unnecessarily scathing
       | 
       | - They repeatedly show data where the models are getting things
       | _right_ 70, 80, 90% of the time, and then show a list of what
       | they call  "qualitative samples" (what does "qualitative" mean?
       | "cherry-picked"?) which look very bad. But it got the answer
       | right 70/80/90% of the time! That's hardly "blind"...
       | 
       | - Various of the tasks hinge on the distinction between two
       | objects "exactly touching" vs. "very nearly touching" vs. "very
       | slightly overlapping", a problem which (i) is hard for humans and
       | (ii) is particularly (presumably deliberately) sensitive to
       | resolution/precision, where we should not be surprised that
       | models fail
       | 
       | - The main fish-shaped example given in task 1 seems genuinely
       | ambiguous to me - do the lines "intersect" once or twice? The
       | tail of the fish clearly has a crossing, but the nose of the fish
       | seems a bit fishy to me... is that really an intersection?
       | 
       | - AFAIC deranged skepticism is just as bad as deranged hype, the
       | framing here is at risk of appealing to the former
       | 
       | It's absolutely fair to make the point that these models are not
       | perfect, fail a bunch of the time, and to point out the edge
       | cases where they suck. That moves the field forwards. But the
       | hyperbole (as pointed out by another commenter) is very annoying.
        
         | neuronet wrote:
         | To be fair, the paper has an emoji in the _title_, so I
         | wouldn't read it as a particularly particularly serious
         | academic study as much as the equivalent of the Gawker of AI
         | research. It is a "gotcha" paper that exploits some blind spots
         | (sorry) that will easily be patched up with a few batches of
         | training. I do think it highlights the lack of AGI in these
         | things, which some people lacking situational awareness might
         | need to see.
        
       | cpill wrote:
       | I wonder how they would score if they used all 4 models and took
       | a majority vote...?
        
       | aaroninsf wrote:
       | The title for this page and argument should be qualified with the
       | specific generation of tools.
       | 
       | That's in the abstract, but, it's bad to not be specific. In this
       | case, because current public-facing models are WIWEB: the worst
       | it will ever be.
       | 
       | And there are trillion-dollar prizes at stake, so, improvement is
       | happening as quickly as it possibly can.
        
       | make3 wrote:
       | Hugged to death from my perspective. Here is a backup:
       | https://archive.ph/kOE3Q
        
         | simonw wrote:
         | That's weird - GitHub Pages serves static content and rarely
         | (in my experience) fails to load.
        
       | jetrink wrote:
       | I had a remarkable experience with GPT-4o yesterday. Our garage
       | door started to fall down recently, so I inspected it and found
       | that our landlord had installed the wire rope clips incorrectly,
       | leading to the torsion cables losing tension. I didn't know what
       | that piece of hardware was called, so I asked ChatGPT and it
       | identified the part as I expected it to. As a test, I asked if
       | there was anything notable about the photo. ChatGPT correctly
       | identified that the cables were installed backwards, with the
       | side of the cable that was (previously) under tension on top of
       | the slack end, instead of sandwiched securely in the middle. To
       | diagnose that requires tracing the cable through space and
       | inferring which end is under tension from the geometry, though I
       | can't rule out an educated guess.
       | 
       | What was really remarkable though was that it failed to notice
       | that one of the two nuts was obviously missing, even after I told
       | it there was a second problem with the installation.
       | 
       | Screenshot: https://imgur.com/a/QqCNzOM
        
         | sfink wrote:
         | A _human_ would need to trace the cable. An LLM may just be
         | responding based on (1) the fact that you 're asking about the
         | clip in the first place, and that commonly happens when there's
         | something wrong; and (2) that this is a very common failure
         | mode. This is supported by it bringing up the "never saddle a
         | dead horse" mnemonic, which suggests the issue is common.
         | 
         | After you fix it, you should try asking the same questions!
        
       | nmca wrote:
       | please use this opportunity to reflect on whether ARC measures
       | reasoning skills :)
        
       | gnutrino wrote:
       | My guess is that the systems are running image recognition
       | models, and maybe OCR on images, and then just piping that data
       | as tokens into an LLM. So you are only ever going to get results
       | as good as existing images models with the results filtered
       | through an LLM.
       | 
       | To me, this is only interesting if compared with results of image
       | recognition models that can already answer these types of
       | questions (if they even exist, I haven't looked).
       | 
       | Maybe the service is smart enough to look at the question, and
       | then choose one or more models to process the image, but not sure
       | as I can't find anything on their sites about how it works.
        
         | Eisenstein wrote:
         | > My guess is that the systems are running image recognition
         | models
         | 
         | Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an
         | idea of how they work.
        
         | simonw wrote:
         | That's not how they work. The original GPT-4 paper has some
         | detail: https://cdn.openai.com/papers/gpt-4.pdf
         | 
         | Or read up on PaliGemma: https://github.com/google-
         | research/big_vision/blob/main/big_...
        
       | nichohel wrote:
       | Vision language models are blind because they lack the Cartesian
       | Theater, which you and I have. Which you and I say we have.
        
       | viraptor wrote:
       | I love some of the interpretations there. For example "Fig. 10:
       | Only Sonnet-3.5 can count the squares in a majority of the
       | images.", when that model simply returns "4" for every question
       | and happens to be right.
        
       | jackblemming wrote:
       | Ask it to draw any of those things and it can.
        
       | mkoubaa wrote:
       | They interact with pixel buffers as a mathematical array. To call
       | them blind is to confuse what they doing with the experience of
       | sight...
        
       ___________________________________________________________________
       (page generated 2024-07-10 23:00 UTC)