[HN Gopher] Vision language models are blind
___________________________________________________________________
Vision language models are blind
Author : taesiri
Score : 245 points
Date : 2024-07-10 13:35 UTC (9 hours ago)
(HTM) web link (vlmsareblind.github.io)
(TXT) w3m dump (vlmsareblind.github.io)
| taesiri wrote:
| This paper examines the limitations of current vision-based
| language models, such as GPT-4 and Sonnet 3.5, in performing low-
| level vision tasks. Despite their high scores on numerous
| multimodal benchmarks, these models often fail on very basic
| cases. This raises a crucial question: are we evaluating these
| models accurately?
| rezaghanbari1 wrote:
| Some of these samples are shocking. How do these models answer
| chart-based questions, I mean when they can't even count the
| intersections between two lines?
| RodgerTheGreat wrote:
| Same way they answer any question: piece together a
| statistically probable sequence of words to follow the prompt.
| All they know about an image is a handful of words a classifier
| might choose to describe it. If those words have nothing to do
| with the question being asked, they can't nudge the model in
| the general direction of a correct answer, so it's a crapshoot-
| even moreso than usual.
| imtringued wrote:
| The dataset most likely contains chart descriptions that
| describe the raw data, but not the visual interactions of the
| individual pixels.
| dheera wrote:
| Current approaches of multi-modal models work on embeddings and
| tokenizations of images, which is the fundamental problem: you
| are feeding blurry, non-precise data into the model. Yes, they
| are blind because of exactly this.
|
| An embedding isn't conceptually that much different from feeding
| a 1024-word description of an image instead of the actual image.
|
| At the moment compute power isn't good enough to feed high-res
| pixel data into these models, unless we discover a vastly
| different architecture, which I am also convinced likely exists.
| jayd16 wrote:
| Doesn't Gemini have a 2 million token limit for exactly this?
| diwank wrote:
| The number of tokens _per image_ are actually fairly small,
| ranging from 85 to ~500.
| visarga wrote:
| > An embedding isn't conceptually that much different from
| feeding a 1024-word description of an image instead of the
| actual image.
|
| An embedding needs less words. You can embed individual words,
| phrases, like a whole prompt and longer paragraphs. You don't
| need 1024 words for a text embed. For example a famous library
| is called Sentence BERT (sbert).
|
| When you embed images on the other hand, you cut them up into
| little squares on the tune of 32x32 px, and embeds one of them
| separately. chatGPT uses something like 250 tokens for smaller
| images. So a smaller image costs about as much as 200 words if
| represented graphically, and maybe much less words if you embed
| a text description of it.
| dheera wrote:
| > needs less words
|
| Yes I'm aware of this, and work in ML -- the thing is
| embeddings are not designed for faithful image
| reconstruction, and aren't even trained that way. You can
| easily find two images that have substantially similar CLIP
| (or whatever) embeddings that are visually very different. If
| you query the LLM about that difference, the LLM wouldn't
| even have the information to differentiate answers for the
| two images if you only supply it with the embedding.
|
| On the other hand, SDXL autoencoder latents passed into an
| LLM alongside the embedding _might_ be a step up from just an
| image embedding, since they are designed for image
| reconstruction, but I don 't have access to the compute or
| data resources to attempt training this.
| cs702 wrote:
| Wow, that is _embarrassingly bad performance_ for current SOTA
| models (GPT-4o, Gemini-1.5 Pro, Sonnet-3, Sonnet-3.5), which are
| advertised and sold as being able to understand images, e.g., for
| guiding the blind or tutoring children in geometry!
|
| The tasks at which they fail are ridiculously simple for human
| beings, including, for example:
|
| * counting the number of times two lines intersect;
|
| * detecting whether two circles overlap;
|
| * selecting which letter is being circled in a word;
|
| * counting the number of circles in an Olympic-like logo.
|
| This should be at the top of the front page.
| tensor wrote:
| I don't see how this is "embarrassing" in the slightest. These
| models are not human brains, and the fact that people equate
| them with human brains is an embarrassing failure of the humans
| more than anything about the models.
|
| It's entirely unsurprising that there are numerous cases that
| these models can't handle that are "obvious to humans." Machine
| learning has had this property since its invention and it's a
| classic mistake humans make dealing with these systems.
|
| Humans assume that because a machine learning model has above
| human accuracy on task X that it implies that it must also have
| that ability at all the other tasks. While a human with amazing
| ability at X would indeed have amazing abilities at other
| tasks, this is not true of machine learning models The opposite
| thinking is also wrong, that because the model can't do well on
| task Y it must be unreliable and it's ability on task X is
| somehow an illusion and not to be trusted.
| cs702 wrote:
| It is embarrassingly, shockingly bad, because these models
| are _advertised_ and _sold_ as being capable of understanding
| images.
|
| Evidently, all these models still fall short.
| kristjansson wrote:
| It's surprising because these models are pretty ok at some
| vision tasks. The existence of a clear failure mode is
| interesting and informative, not embarrassing.
| knowaveragejoe wrote:
| Not only are they capable of understanding images(the kind
| people might actually feed into such a system -
| photographs), but they're pretty good at it.
|
| A modern robot would struggle to fold socks and put them in
| a drawer, but they're great at making cars.
| pixl97 wrote:
| I mean, with some of the recent demos, robots have got a
| lot better at folding stuff and putting it up. Not saying
| it's anywhere close to human level, but it has taken a
| pretty massive leap from being a joke just a few years
| ago.
| startupsfail wrote:
| Humans are also shockingly bad on these tasks. And guess
| where the labeling was coming from...
| simonw wrote:
| I see this complaint about LLMs all the time - that they're
| advertised as being infallible but fail the moment you give
| them a simple logic puzzle or ask for a citation.
|
| And yet... every interface to every LLM has a "ChatGPT can
| make mistakes. Check important info." style disclaimer.
|
| The hype around this stuff may be deafening, but it's often
| not entirely the direct fault of the model vendors
| themselves, who even put out lengthy papers describing
| their many flaws.
| jazzyjackson wrote:
| There's evidently a large gap between what researchers
| publish, the disclaimers a vendor makes, and what gets
| broadcast on CNBC, no surprise there.
| jampekka wrote:
| A bit like how Tesla Full Self-Driving is not to be used
| as self-driving. Or any other small print. Or ads in
| general. Lying by deliberately giving the wrong
| impression.
| mrbungie wrote:
| These models are marketed as being able to guide the blind or
| tutoring children using direct camera access.
|
| Promoting those use cases and models failing in these ways is
| irresponsible. So, yeah, maybe the models are not embarrasing
| but the hype definitely is.
| cs702 wrote:
| _> Promoting those use cases and models failing in these
| ways is irresponsible._
|
| Yes, _exactly_.
| scotty79 wrote:
| You'd expect them to be trained on simple geometry since you
| can create arbitrarily large synthetic training set for that.
| sfink wrote:
| Well said.
|
| It doesn't matter how they are marketed or described or held
| up to some standard generated by wishful thinking. And it
| especially doesn't matter what it would mean if a human were
| to make the same error.
|
| It matters what they are, what they're doing, and how they're
| doing it. Feel free to be embarrassed if _you_ are claiming
| they can do what they can 't and are maybe even selling them
| on that basis. But there's nothing embarrassing about their
| current set of capabilities. They are very good at what they
| are very good at. Expecting those capabilities to generalize
| as they would if they were human is like getting embarrassed
| that your screwdriver can't pound in a nail, when it is ever
| so good at driving in screws.
| insane_dreamer wrote:
| > is an embarrassing failure of the humans more than anything
| about the models
|
| No, it's a failure of the companies who are advertising them
| as capable of doing something which they are not (assisting
| people with low vision)
| simonw wrote:
| But they CAN assist people with low vision. I've talked to
| someone who's been using a product based on GPT-4o and
| absolutely loves it.
|
| Low vision users understand the limitations of
| accessibility technology better than anyone else. They will
| VERY quickly figure out what this tech can be used for
| effectively and what it can't.
| sweezyjeezy wrote:
| Entertaining, but I think the conclusion is way off.
|
| > their vision is, at best, like that of a person with myopia
| seeing fine details as blurry
|
| is a crazy thing to write in an abstract. Did they try to probe
| that hypothesis at all? I could (well actually I can't) share
| some examples from my job of GPT-4v doing some pretty difficult
| fine-grained visual tasks that invalidate this.
|
| Personally, I rate this paper [1], which makes the argument that
| these huge GenAI models are pretty good at things - _assuming
| that it has seen a LOT of that type of data during training_
| (which is true of a great many things). If you make up tasks like
| this, then yes can be REALLY bad at them, and initial impressions
| of AGI get harder to justify. But in practice, we aren 't just
| making up tasks to trip up these models. They can be very
| performant on some tasks and the authors have not presented any
| real evidence about these two modes.
|
| [1] https://arxiv.org/abs/2404.04125
| diwank wrote:
| Yeah I think their findings are def interesting but the title
| and the strong claims are a tad hyperbolic.
| SomaticPirate wrote:
| There are quite a few "ai apologists" in the comments but I
| think the title is fair when these models are marketed towards
| low vision people ("Be my eyes"
| https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent
| to human vision. These models are implied to be human level
| equivalents when they are not.
|
| This paper demonstrates that there are still some major gaps
| where simple problems confound the models in unexpected ways.
| These is important work to elevate otherwise people may start
| to believe that these models are suitable for general
| application when they still need safeguards and copious
| warnings.
| sweezyjeezy wrote:
| The paper I linked should hopefully mark me out as far from
| an AI apologist, it's actually really bad news for GenAI if
| correct. All I mean to say is the clickbait conclusion and
| the evidence do not match up.
| Melomololotolo wrote:
| We have started the ara of ai.
|
| It really doesn't matter how good current llms are.
|
| They have been good enough to start this ara.
|
| And no it's not and never has been just llms. Look what
| Nvidia is doing with ml.
|
| Whisper huge advantage, segment anything again huge. Alpha
| fold 2 again huge.
|
| All the robot announcements -> huge
|
| I doubt we will reach agi just through llms. We will reach
| agi through multi modal, mix of experts, some kind of
| feedback loop, etc.
|
| But the stone started to roll.
|
| And you know I prefer to hear about ai advantages for the
| next 10-30 years. That's a lot better than the crypto shit
| we had the last 5 years.
| pixl97 wrote:
| Yea, really if you look at human learning/seeing/acting there
| is a feedback loop that LLM for example isn't able to
| complete and train on.
|
| You see an object. First you have to learn how to control all
| your body functions to move toward it and grasp it. This
| teaches you about the 3 dimensional world and things like
| gravity. You may not know the terms, but it is baked in your
| learning model. After you get an object you start building a
| classification list "hot", "sharp", "soft and fuzzy",
| "tasty", "slick". Your learning model builds up a list of
| properties of objects and "expected" properties of objects.
|
| Once you have this 'database' you create as a human, you can
| apply the logic to achieve tasks. "Walk 10 feet forward, but
| avoid the sharp glass just to the left". You have to have
| spatial awareness, object awareness, and prediction ability.
|
| Models 'kind of' have this, but its seemingly haphazard, kind
| of like a child that doesn't know how to put all the pieces
| together yet. I think a lot of embodied robot testing where
| the embodied model feeds back training to the LLM/vision
| model will have to occur before this is even somewhat close
| to reliable.
| Melomololotolo wrote:
| Ah yes the blind person who constantly needs to know if two
| lines intersect.
|
| Let's just ignore what a blind person normally needs to know.
|
| You know what blind people ask? Sometimes there daily routine
| is broken because there is some type of construction and
| models can tell you this.
|
| Sometimes they need to read a basic sign and models can do
| this.
|
| Those models help people already and they will continue to
| get better.
|
| I'm not sure if I'm more frustrated how condescending the
| authors are or your ignorance.
|
| Valid criticism doesn't need to be shitty
| shagie wrote:
| As an aside... from 2016 this is what was a valid use case
| for a blind person with an app.
|
| Seeing AI 2016 Prototype - A Microsoft research project -
| https://youtu.be/R2mC-NUAmMk
|
| https://www.seeingai.com are the actual working apps.
|
| The version from 2016 I recall showing (pun not intended)
| to a coworker who had some significant vision impairments
| and he was _really_ excited about what it could do back
| then.
|
| ---
|
| I still remain quite impressed with its ability to parse
| the picture and likely reason behind it
| https://imgur.com/a/JZBTk2t
| benreesman wrote:
| If we're throwing "citation needed" tags on stuff, how about
| the first sentence?
|
| "Large language models with vision capabilities (VLMs), e.g.,
| GPT-4o and Gemini-1.5 Pro are powering countless image-text
| processing applications"
|
| I don't know how many a "countless" is, but I think we've
| gotten really sloppy in terms of what counts for LLMs as a
| demonstrated, durable win in a concrete task attached to
| well-measured outcomes and holding up over even modest
| periods of time.
|
| This stuff is really promising and lots of builders are
| making lots of nifty things, so if that counts as an
| application then maybe we're at countless, but in the
| enterprise and in government and in refereed academic
| literature we seem to be at the proof-of-concept phase.
| Impressive chat bots as a use case are pretty dialed in,
| enough people claim that they help with coding that I tend to
| believe it's a real thing (I never seem to come out ahead of
| going directly to the source, StackOverflow).
|
| The amount of breathless press on this seems "countless", so
| maybe I missed the totally rigorous case study on how X
| company became Y percent more profitable by doing Z thing
| with LLMs (or similar), and if so I'd be grateful for
| citations, but neither Google nor any of the big models seem
| to know about it.
| Lerc wrote:
| I disagree. I think the title, abstract, and conclusion not
| only misrepresents the state of the models but it
| misrepresents Thier own findings.
|
| They have identified a class of problems that the models
| perform poorly at and have given a good description of the
| failure. They portray this as a representative example of the
| behaviour in general. This has not been shown and is probably
| not true.
|
| I don't think that models have been portrayed as equivalent
| to humans. Like most AI in it has been shown as vastly
| superior in some areas and profoundly ignorant in others.
| Media can overblow things and enthusiasts can talk about
| future advances as if they have already arrived, but I don't
| think these are typical portayals by the AI Field in general.
| FrenchDevRemote wrote:
| > their vision is, at best, like that of a person with myopia
| seeing fine details as blurry
|
| It's not that far from reality, most models sees images in very
| low resolution/limited colors, so not so far from this
| description
| blackmesaind wrote:
| My thoughts as well. I too would have trouble with the
| overlapping lines tests if all the images underwent
| convolution.
| vikramkr wrote:
| They didn't test that claim at all though. Vision isn't some
| sort of 1D sliding scale with every vision condition lying
| along one axis.
|
| First of all myopia isn't 'seeing fine details as blurry' -
| it's nearsightedness - and whatever else this post tested it
| definitely didn't test depth perception.
|
| And second - inability to see fine details is a
| distinct/different thing from not being able to count
| intersections and the other things tested here. That
| hypothesis, if valid, would imply that improving the
| resolution of the image that the model can process would
| improve its performance on these tasks even if reasoning
| abilities were the same. That - does not make sense. Plenty
| of the details in these images that these models are tripping
| up on are perfectly distinguishable at low resolutions.
| Counting rows and columns of blank grids is not going to
| improve with more resolution.
|
| I mean, I'd argue that the phrasing of the hypothesis ("At
| best, like that of a person with myopia") doesn't make sense
| at all. I don't think a person with myopia would have any
| trouble with these tasks if you zoomed into the relevant
| area, or held the image close. I have a very strong feeling
| that these models would continue to suffer on these tasks if
| you zoomed in. Nearsighted != unable to count squares.
| jrflowers wrote:
| >I could (well actually I can't)
|
| I like the idea that these models are so good at some sort of
| specific and secret bit of visual processing that things like
| "counting shapes" and "beating a coin toss for accuracy"
| shouldn't be considered when evaluating them.
| vikramkr wrote:
| Those don't really have anything to do with fine
| detail/nearsightedness. What they measured is
| valid/interesting - what they concluded is unrelated.
| valine wrote:
| LLMs are bad at counting things just in general. It's hard to
| say whether the failures here are vision based or just an
| inherent weakness of the language model.
| godelski wrote:
| > Did they try to probe that hypothesis at all?
|
| I think this is a communication issue and you're being a bit
| myopic in your interpretation. It is clearly an analogy meant
| for communication and is not an actual hypothesis. Sure, they
| could have used a better analogy and they could have done other
| tests, but the paper still counters quite common claims (from
| researchers) about VLMs.
|
| > I could (well actually I can't) share some examples from my
| job of GPT-4v doing some pretty difficult fine-grained visual
| tasks that invalidate this.
|
| I find it hard to believe that there is no example you can
| give. It surely doesn't have to be exactly your training data.
| If it is this good, surely you can create an example no
| problem. If you just don't want to, that's okay, but then don't
| say it.
|
| But I have further questions. Do you have complicated
| prompting? Or any prompt engineering? It sure does matter how
| robust these models are to prompting. There's a huge difference
| between a model being able to accomplish a task and a model
| being able to perform a task in a non-very-specific
| environment. This is no different than something working in a
| tech demo and not in the hand of the user.
|
| > But in practice, we aren't just making up tasks to trip up
| these models.
|
| I see this sentiment quite often and it is baffling to me.
|
| First off, these tasks are not clearly designed to trick these
| models. A model failing at a task is not suddenly "designed to
| trick a model." Its common with the river crossing puzzles
| where they're rewritten to be like "all animals can fit in the
| boat." If that is "designed to trick a model", then the model
| must be a stochastic parrot and not a generalist. It is very
| important that we test things where we do know the answer to
| because, unfortunately, we're not clairvoyant and can't test
| questions we don't know the answer to. Which is the common case
| in the real world usage.
|
| Second, so what if a test was designed to trick up a model?
| Shouldn't we be determining when and where models fail? Is that
| not a critical question in understanding how to use them
| properly? This seems doubly important if they are tasks that
| humans don't have challenges with them.
|
| > They can be very performant on some tasks and the authors
| have not presented any real evidence about these two modes.
|
| I don't think people are claiming that large models can't be
| performant on some tasks. If they are, they're rejecting
| trivially verifiable reality. But not every criticism and has
| to also contain positive points. There's plenty of papers and a
| lot of hype already doing that. And if we're going to be
| critical of anything, shouldn't it be that the companies
| creating these models -- selling them, and even charging
| researchers to perform these types of experiments that the can
| and are used to improve their products -- should be much more
| clear about the limitations of their models? If we need
| balance, then I think there's bigger fish to fry than Auburn
| and Alberta Universities.
| orbital-decay wrote:
| _> I think this is a communication issue and you 're being a
| bit myopic in your interpretation. It is clearly an analogy
| meant for communication and is not an actual hypothesis._
|
| I don't know, words have meanings. If that's a communication
| issue, it's on part of the authors. To me, this wording in a
| what is supposed to be a research paper abstract clearly
| suggests the insufficient resolution as the cause. How else
| should I interpret it?
|
| _> The shockingly poor performance of four state-of-the-art
| VLMs suggests their vision is, at best, like that of a person
| with myopia seeing fine details as blurry_
|
| And indeed, increasing the resolution is expensive, and the
| best VLMs have something like 1000x1000. But the low
| resolution is clearly not the issue here, and the authors
| don't actually talk about it in the paper.
|
| _> I find it hard to believe that there is no example you
| can give._
|
| I'm not the person you're answering to, but I actually lazily
| tried two of authors' examples in a less performant VLM
| (CogVLM), and was surprised it passed those, making me wonder
| whether I can trust their conclusions until I reproduce their
| results. LLMs and VLMs have all kinds of weird failure modes,
| it's not a secret they fail at some trivial tasks and their
| behavior is still not well understood. But working with these
| models and narrowing it down is notoriously like trying to
| nail a jelly to the wall. If I was able to do this in a
| cursory check, what else is there? More than one research
| paper in this area is wrong from the start.
| diwank wrote:
| I think while the findings are interesting, the title "Vision
| language models are blind" is hyperbolic and misleading of the
| findings. It is true that VLMs have a very different way of
| sensing and working with image inputs and for low resolutions
| each image is split up into blocks and then mapped into discrete
| tokens. They actually really dont have access to finegrained
| details because this mapping is quite lossy. So, in that sense,
| these results completely make sense and were not surprising to me
| but "blind" is a strong implication and the work isnt really able
| to back that.
|
| Take the first example of the four 2D line plots for instance. I
| asked Sonnet 3.5 for an answer and retried it 5 times and 2 out
| of 5 times, it said something like this:
|
| _Me_ :
|
| here's an image with four 2-D line plots. Can you tell me A) how
| many lines are in each plot? and B) how many times do they
| intersect each other in each plot?
|
| Write your answers as: Plot 1 (leftmost): ...
|
| _Sonnet_ :
|
| I'll analyze each plot from left to right and provide the
| requested information:
|
| Plot 1 (leftmost): A) 2 lines (one red, one blue) B) 0
| intersections (the lines do not cross)
|
| Plot 2: A) 2 lines (one red, one blue) B) 1 intersection (the
| lines cross once near the right edge)
|
| Plot 3: A) 2 lines (one red, one blue) B) 1 intersection (the
| lines cross once near the center)
|
| Plot 4 (rightmost): A) 2 lines (one red, one blue) B) 2
| intersections (the lines cross twice, once near the left side and
| once near the right side)
|
| Each plot consistently shows two lines, one red and one blue,
| with varying intersection patterns across the four plots.
|
| ---
|
| Which isn't completely correct (the 3rd plot has 2 intersections)
| but still pretty good and the other 3 times, it said plot 3 has 1
| intersection.
| Jack000 wrote:
| This is kind of the visual equivalent of asking an LLM to count
| letters. The failure is more related to the tokenization scheme
| than the underlying quality of the model.
|
| I'm not certain about the specific models tested, but some VLMs
| just embed the image modality into a single vector, making these
| tasks literally impossible to solve.
| JeremyHerrman wrote:
| VLMs so far have never been good at counting objects or spatial
| relationships (e.g. the coffee is to the right of the microwave).
|
| There are ways to help the VLM out - Set of Marks [0] from
| Microsoft being the most prominent, which uses segmentation to
| outline and label sections of the image before sending to the
| VLM.
|
| Providing "speakable" labels to regions helps ground the visual
| abilities of VLMs and is why in this paper the performance is so
| much better when words are present in the grid for "Task 6:
| Counting the rows and columns of a grid"
|
| 0: https://github.com/microsoft/SoM
| jazzyjackson wrote:
| I didn't know counting objects was a problem. That's pretty
| ironic because the very first implementation of a neural net
| (AFAIK) is the numa-rete artificial retina developed at the
| Biological Computer Lab [0] circa 1960. It was a parallel
| analog computer composed of "nuerons" each with a photocell
| that could be arranged in a grid and count "the number of
| objects independent of their size, location and form, and
| independent of strength of illumination" [1] - this paper may
| be of interest to those in the field, "Perception of Form in
| Biological and Man Made Systems" Heinz Von Foerster 1962
|
| [0]
| https://distributedmuseum.illinois.edu/exhibit/biological_co...
|
| [1] https://sites.evergreen.edu/arunchandra/wp-
| content/uploads/s...
| empath75 wrote:
| It really shouldn't be surprising that these models fail to
| do anything that _they weren't trained to do_. It's trivially
| easy to train a model to count stuff. The wild thing about
| transformer based models is that their capabilities are _way_
| beyond what you'd expect from token prediction. Figuring out
| what their limitations actually are is interesting because
| nobody fully knows what their limitations are.
| jazzyjackson wrote:
| I agree that these open ended transformers are way more
| interesting and impressive than a purpose built count-the-
| polygons model, but if the model doesn't generalize well
| enough to figure out how to count the polygons, I can't be
| convinced that they'll perform usefully on a more
| sophisticated task.
|
| I agree this research is really interesting, but I didn't
| have an a priori expectation of what token prediction could
| accomplish, so my reaction to a lot of the claims and
| counterclaims of this new tech is that it's good at fooling
| people and giving plausible but baseless results. It makes
| for good research but dangerous in the hands of a market
| attempting to exploit it.
| Eisenstein wrote:
| Vision models use CLiP or something similar, which has no
| conception of anything specific in the image. It sees
| embeddings which correlate similarly to text embeddings. Take
| an image then describe it 'there are birds sitting on a power
| line in front of a blue sky with some clouds', get the
| embeddings from that and the embeddings from that picture and
| line them up. If you ask if there are birds in it, it would
| know, but not how many, unless it was common to describe the
| number of birds sitting on things and it happened often enough
| that the number counted was the number in the image
| descriptions it trained on. If you want to count objects you
| want something like YOLO.
| JeremyHerrman wrote:
| VLMs like PaliGemma and Florence-2 support object detection
| and segmentation, so it's becoming more common to have YOLO
| like capabilities built into VLMs.
|
| Another benefit of VLMs which support object detection is
| that they are open vocabulary, meaning you don't have to
| define the classes ahead of time. Additionally fine tuning
| tends to keep the previous detection capabilities instead of
| erasing all previous classes like fine tuning YOLO.
| GaggiX wrote:
| Well, all the models (especially Claude 3.5 Sonnet) seem to
| perform much better than random, so they are clearly not blind.
| The only task where Claude Sonnet 3.5 does not perform better
| than random is the one where you have to follow many different
| paths (the ones where the answer from A to C is 3), something
| that would take me several seconds to solve.
|
| I have the feeling that they first choose the title of the paper
| and then run the evaluation on the new Claude 3.5 Sonnet on these
| abstract images.
|
| >their vision is, at best, like that of a person with myopia
| seeing fine details as blurry
|
| This also makes no sense, since the images evaluate the abstract
| capabilities of the models, not their eyesight.
| randcraw wrote:
| OK. They're _legally_ blind.
| GaggiX wrote:
| This really has nothing to do with vision impairment.
| iamleppert wrote:
| This could easily be fixed with training and fine tuning. Simply
| generate 100,000 examples or so, and train with ground truth
| until however long you want and its a solved problem.
| kristjansson wrote:
| Solved for this benchmark... and at what cost to the rest of
| the system?
|
| These tasks are interesting because they're existence proofs of
| generalization failure. Like the haystack problem, direct
| solutions here are much less interesting than structural
| improvements that address the class of failure.
| imtringued wrote:
| Ok, but most of the data is just captions for images. You're
| going to have to invest some time into building this dataset at
| your own expense.
| _vaporwave_ wrote:
| It's really interesting that there's a huge performance
| discrepancy between these SOTA models. In the Olympic logo
| example, GPT-4o is below the baseline accuracy of 20% (worse than
| randomly guessing) while Sonnet-3.5 was correct ~76% of the time.
|
| Does anyone have any technical insight or intuition as to why
| this large variation exists?
| londons_explore wrote:
| Could some of the "wrong" answers be the LLM attempting to give
| an explanation rather than the answer, eg. instead of answering
| 'X', the LLM answers 'The letter is partially hidden by the oval,
| so cannot be certain, but it appears to be the english letter X'.
|
| The scoring criteria would rank this answer as 'T', which is
| wrong.
| simonw wrote:
| I've been generally frustrated at the lack of analysis of vision
| LLMs generally.
|
| They're clearly a very exciting category of technology, and a
| pretty recent one - they only got good last October with GPT-4
| Vision, but since then we've had more vision models from
| Anthropic and Google Gemini.
|
| There's so much more information about there about text prompting
| compared to image prompting. I feel starved for useful
| information about their capabilities: what are vision models good
| and bad at, and what are the best ways to put them to work?
| r2_pilot wrote:
| Why not use them yourself if you have access? I have been using
| Claude 3.5 Sonnet for gardening recently, and while it's not
| perfect(and can be a little blind unless you tell it to focus
| on a specific thing), it's helped me understand how to keep my
| plants alive in some challenging conditions(for me; this is my
| second or third attempt at gardening so it's all challenging
| lol). But just a experiment with it and see where the
| capabilities lie. I do agree that certain classes of visual
| data are challenging for it.
| simonw wrote:
| I've used them a bunch. I want to learn from other people's
| experiences as well.
|
| Some of my notes so far:
|
| - https://simonwillison.net/2024/Apr/17/ai-for-data-
| journalism... - my datasette-extract plugin, for structured
| data from both text and images
|
| - https://simonwillison.net/2024/Apr/17/ai-for-data-
| journalism... - where they failed to extract data from a
| handwritten scanned document in various weird ways
|
| - https://simonwillison.net/2024/Feb/21/gemini-pro-video/
| talks about video inputs to Gemini Pro (which are actually
| image inputs, it splits them up to one frame per second)
| simonw wrote:
| Anthropic have some interesting cookbook examples that provide
| advice on using their multimodal models here:
| https://github.com/anthropics/anthropic-cookbook/tree/main/m...
|
| I've assembled a bunch more notes here:
| https://simonwillison.net/tags/vision-llms/
| mglz wrote:
| I tought some Computational Geometry courses and efficiently
| computing the intersections of N line segments is not as
| straightforward as you might initially think. Since somewhere
| some computation must be done to recognize this and LLMs are not
| specifically trained for this task, it's not suprising they
| struggle.
|
| In general, basic geometry seems under-explored by learning.
| orbital-decay wrote:
| That's not anything like "myopia", though.
|
| FWIW I tried the line intersection and the circled letter test
| from the article with CogVLM (which is far from reaching the
| current SotA) and it correctly passed both. I haven't tried it
| with Sonnet/4o but I suspect there might be something wrong with
| how the author did their tests. Don't get me wrong, but too many
| "the model can't do that" claims ended up with demonstrations of
| the model doing exactly that...
| nyxtom wrote:
| I wonder how well Alpha Geometry would do on this
| nybsjytm wrote:
| AlphaGeometry is a hyper-specific system trained to add
| auxiliary geometric objects, like extra lines, to existing
| Euclidean geometry configurations. These prompts are not even
| sensible inputs to AlphaGeometry.
| hi_dang_ wrote:
| I was hoping that someone in the comments talking the paper down
| would have published a paper or have had relevant publications of
| their own to point to. You know, meet the lads halfway sort of
| thing.
|
| So what I'm left with to judge instead is anonymous online
| commenters vs. the publication of 2 prestigious universities.
| Whose word do I take on this? Decisions, decisions.
|
| You can swap LM out with Web3 out with NFT out with Crypto in
| this case.
| warkdarrior wrote:
| > I'm left with [...] is anonymous online commenters vs. the
| publication of 2 prestigious universities. Whose word do I take
| on this?
|
| Maybe you need to judge the contents of those online comments
| and the contents of the publication, instead of relying on
| argument from authority.
| vessenes wrote:
| A few comments below talk about how tokenizing images using stuff
| like CLIP de-facto yields blurry image descriptions, and so these
| are 'blind' by some definitions. Another angle of blurring not
| much discussed is that the images are rescaled down; different
| resolutions for different models. I wouldn't be surprised if
| Sonnet 3.5 had a higher res base image it feeds in to the model.
|
| Either way, I would guess that we'll need new model architectures
| for multimodal to get really good at some of this, and even then
| some of these tasks are adjacent to things that we know LLMs are
| already bad at (numeric logic, for instance).
|
| As context lengths get longer, devoting more tokens to the image
| tokenization should help a bit here as well. Anyway, I'd
| anticipate next year we'd see 80s and 90s for most of these
| scores with next gen models.
| imtringued wrote:
| The problem with the current crop of projectors such as LLaVA
| is that as far as I know do not take the previous conversation
| into account. You only really get zero shot responses. This
| means that you cannot steer the model towards paying attention
| to specific instruction related details. The projector simply
| creates a token representation of the visuals (not necessarily
| human language tokens) and the LLM just processes that as
| usual.
| vessenes wrote:
| The original gpt4 did this too, it had almost no memory
| before or after the image provided. I haven't tested gpt4o on
| this directly, but my feeling is that it's better from casual
| usage.
|
| I do think some of these thin line drawings are likely extra
| hard to tokenize depending on the image scaling sizes for
| tokenization. I'd wager thicker lines would help, although
| obviously not all of this is just 'poor tokenization'.
| spullara wrote:
| in other news, vision models are bad at things they aren't
| trained to do
| akavi wrote:
| Speaking as someone with only a tenuous grasp of how VLMs work,
| this naively feels like a place where the "embodiement" folks
| might have a point: Humans have the ability to "refine" their
| perception of an image iteratively, focusing in on areas of
| interest, while VLMs have to process the entire image at the same
| level of fidelity.
|
| I'm curious if there'd be a way to emulate this (have the visual
| tokens be low fidelity at first, but allow the VLM to emit tokens
| that correspond to "focusing" on a region of the image with
| greater resolution). I'm not sure if/how it's possible to
| performantly train a model with "interactive" data like that,
| though
| efskap wrote:
| Isn't this the attention mechanism, the reason we're using
| transformers for these things? Maybe not greater resolution per
| se, but focusing on a region with greater neural connectivity
| akavi wrote:
| Ah, good point!
|
| But the model is downstream of the "patch" tokenization, so
| the cut-down in resolution (compression) of the image has
| already occurred _prior_ to the point where the model can
| direct greater "attention".
|
| I think the synthesis is that I'm proposing a per-pixel
| tokenization with a transformer block whose purpose is to
| output information at a compression level "equivalent" to
| that of the patch tokens (is this what an autoencoder is?),
| but where the attention vector is a function of the full
| state of the LLM (ie, inclusive of the text surrounding the
| image)).
|
| Naively, I'd think a layer like this that is agnostic to the
| LLM state needn't be any more computationally costly than the
| patching computation (both are big honks of linear algebra?),
| but idk how expensive the "full context attention" feedback
| is...
|
| (I apologize to anyone who actually understands transformers
| for my gratuitous (ab|mis)use of terminology)
| Brechreiz wrote:
| >Humans have the ability to "refine" their perception of an
| image iteratively
|
| That's not related to embodied cognition.
| akavi wrote:
| Is embodied cognition not at least in part about
| interactivity? I perform action (emit tokens) and receive
| feedback (non-self-generated tokens)
| kromem wrote:
| Lots and lots of eye tracking data paired with what was being
| looked at in order to emulate human attention processing might
| be one of the lower hanging fruits for improving it.
| tantalor wrote:
| Are the "random-baseline accuracy" numbers correct?
|
| In the "Two circles" test, do they really have 50% chance of
| overlapping? I think this comes from "Distances between circle
| perimeters: -0.15 to 0.5 times the diameter" but doesn't say the
| distribution they use.
| jdlshore wrote:
| They asked the AI a question with a yes/no response. If the AI
| chose randomly, it would be correct 50% of the time. That's
| what "random baseline accuracy" means.
| jeromeparadis wrote:
| One use-case I always try is to have an AI try to read a school
| calendar image where days off are or days of interest are
| highlighted using a legend. i.e.: days with a square, circle or
| triangle or different color, etc.
|
| When asking days for specific days of interest for the school
| year, AIs always struggle. They get some days right but forget
| some or fabulate new days. They fare a bit better if you remove
| some of the noise and give them only a picture of a month but
| even then, it's unreliable.
| verbalstoner wrote:
| It's virtually impossible to take a paper seriously when the
| title has an emoji.
| axblount wrote:
| Would you say they have _Blindsight_?
| pjs_ wrote:
| I don't like this paper for the following reasons:
|
| - The language is unnecessarily scathing
|
| - They repeatedly show data where the models are getting things
| _right_ 70, 80, 90% of the time, and then show a list of what
| they call "qualitative samples" (what does "qualitative" mean?
| "cherry-picked"?) which look very bad. But it got the answer
| right 70/80/90% of the time! That's hardly "blind"...
|
| - Various of the tasks hinge on the distinction between two
| objects "exactly touching" vs. "very nearly touching" vs. "very
| slightly overlapping", a problem which (i) is hard for humans and
| (ii) is particularly (presumably deliberately) sensitive to
| resolution/precision, where we should not be surprised that
| models fail
|
| - The main fish-shaped example given in task 1 seems genuinely
| ambiguous to me - do the lines "intersect" once or twice? The
| tail of the fish clearly has a crossing, but the nose of the fish
| seems a bit fishy to me... is that really an intersection?
|
| - AFAIC deranged skepticism is just as bad as deranged hype, the
| framing here is at risk of appealing to the former
|
| It's absolutely fair to make the point that these models are not
| perfect, fail a bunch of the time, and to point out the edge
| cases where they suck. That moves the field forwards. But the
| hyperbole (as pointed out by another commenter) is very annoying.
| neuronet wrote:
| To be fair, the paper has an emoji in the _title_, so I
| wouldn't read it as a particularly particularly serious
| academic study as much as the equivalent of the Gawker of AI
| research. It is a "gotcha" paper that exploits some blind spots
| (sorry) that will easily be patched up with a few batches of
| training. I do think it highlights the lack of AGI in these
| things, which some people lacking situational awareness might
| need to see.
| cpill wrote:
| I wonder how they would score if they used all 4 models and took
| a majority vote...?
| aaroninsf wrote:
| The title for this page and argument should be qualified with the
| specific generation of tools.
|
| That's in the abstract, but, it's bad to not be specific. In this
| case, because current public-facing models are WIWEB: the worst
| it will ever be.
|
| And there are trillion-dollar prizes at stake, so, improvement is
| happening as quickly as it possibly can.
| make3 wrote:
| Hugged to death from my perspective. Here is a backup:
| https://archive.ph/kOE3Q
| simonw wrote:
| That's weird - GitHub Pages serves static content and rarely
| (in my experience) fails to load.
| jetrink wrote:
| I had a remarkable experience with GPT-4o yesterday. Our garage
| door started to fall down recently, so I inspected it and found
| that our landlord had installed the wire rope clips incorrectly,
| leading to the torsion cables losing tension. I didn't know what
| that piece of hardware was called, so I asked ChatGPT and it
| identified the part as I expected it to. As a test, I asked if
| there was anything notable about the photo. ChatGPT correctly
| identified that the cables were installed backwards, with the
| side of the cable that was (previously) under tension on top of
| the slack end, instead of sandwiched securely in the middle. To
| diagnose that requires tracing the cable through space and
| inferring which end is under tension from the geometry, though I
| can't rule out an educated guess.
|
| What was really remarkable though was that it failed to notice
| that one of the two nuts was obviously missing, even after I told
| it there was a second problem with the installation.
|
| Screenshot: https://imgur.com/a/QqCNzOM
| sfink wrote:
| A _human_ would need to trace the cable. An LLM may just be
| responding based on (1) the fact that you 're asking about the
| clip in the first place, and that commonly happens when there's
| something wrong; and (2) that this is a very common failure
| mode. This is supported by it bringing up the "never saddle a
| dead horse" mnemonic, which suggests the issue is common.
|
| After you fix it, you should try asking the same questions!
| nmca wrote:
| please use this opportunity to reflect on whether ARC measures
| reasoning skills :)
| gnutrino wrote:
| My guess is that the systems are running image recognition
| models, and maybe OCR on images, and then just piping that data
| as tokens into an LLM. So you are only ever going to get results
| as good as existing images models with the results filtered
| through an LLM.
|
| To me, this is only interesting if compared with results of image
| recognition models that can already answer these types of
| questions (if they even exist, I haven't looked).
|
| Maybe the service is smart enough to look at the question, and
| then choose one or more models to process the image, but not sure
| as I can't find anything on their sites about how it works.
| Eisenstein wrote:
| > My guess is that the systems are running image recognition
| models
|
| Your guess is incorrect. Look up CLIP, BLIP, and SigLip for an
| idea of how they work.
| simonw wrote:
| That's not how they work. The original GPT-4 paper has some
| detail: https://cdn.openai.com/papers/gpt-4.pdf
|
| Or read up on PaliGemma: https://github.com/google-
| research/big_vision/blob/main/big_...
| nichohel wrote:
| Vision language models are blind because they lack the Cartesian
| Theater, which you and I have. Which you and I say we have.
| viraptor wrote:
| I love some of the interpretations there. For example "Fig. 10:
| Only Sonnet-3.5 can count the squares in a majority of the
| images.", when that model simply returns "4" for every question
| and happens to be right.
| jackblemming wrote:
| Ask it to draw any of those things and it can.
| mkoubaa wrote:
| They interact with pixel buffers as a mathematical array. To call
| them blind is to confuse what they doing with the experience of
| sight...
___________________________________________________________________
(page generated 2024-07-10 23:00 UTC)