[HN Gopher] Vision Language Models Are Biased
       ___________________________________________________________________
        
       Vision Language Models Are Biased
        
       Author : taesiri
       Score  : 105 points
       Date   : 2025-06-03 12:47 UTC (10 hours ago)
        
 (HTM) web link (vlmsarebiased.github.io)
 (TXT) w3m dump (vlmsarebiased.github.io)
        
       | taesiri wrote:
       | State-of-the-art Vision Language Models achieve 100% accuracy
       | counting on images of popular subjects (e.g. knowing that the
       | Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17%
       | accurate in counting in counterfactual images (e.g. counting
       | stripes in a 4-striped Adidas-like logo or counting legs in a
       | 5-legged dog).
        
         | LorenDB wrote:
         | There's no need to repeat what is said at the top of the linked
         | webpage.
        
       | shenkha wrote:
       | fun findings related to memorization of AI models. It simply
       | means LLMs/VLLMs do not know how to predict generally but
       | memorizing instead. A new perspective on adversarial attack
       | methods.
        
         | taesiri wrote:
         | for overly represented concepts, like popular brands, it seems
         | that the model "ignores" the details once it detects that the
         | overall shapes or patterns are similar. Opening up the vision
         | encoders to find out how these images cluster in the embedding
         | space should provide better insights.
        
           | kmeisthax wrote:
           | If there aren't any five-legged dogs in your trainset, it's
           | safer[0] to just remember that all dogs are four-legged than
           | to actually recognize and count legs. After all, you might
           | have a few images of dogs in your trainset that are
           | misleading enough to look five-legged (e.g. because a dog is
           | in front of another dog).
           | 
           | Overrepresentation is a _different_ source of bias. That 's
           | what gives you, say, image generators that always draw
           | "golden 1970s sci-fi robot" as C3-PO even when given
           | additional instructions to draw something else.
           | 
           | Both of these problems are manifestations of the difference
           | between training and deployment distributions. Ok, I _guess_
           | you could say that four-legged dogs are  "overrepresented" in
           | the training set, but that's because four-legged dogs are
           | also overrepresented _in reality_. The deployment
           | distribution doesn 't have five-legged dogs in it. What we've
           | done is instead concoct an adversarial distribution to
           | _force_ a train /deploy gap where none would exist.
           | 
           | Releasing the vision encoder won't help because weights are
           | opaque. Stochastic gradient descent does not yield functional
           | internal representations[1]; it fills the bucket of
           | parameters with one distribution and one distribution only.
           | We could tell if, say the vision encoder produces identical
           | embeddings for dogs regardless of leg count, or some other
           | counterfactuals; but not much more than that.
           | 
           | [0] Lower loss and possibly lower L2-norm
           | 
           | [1] https://arxiv.org/abs/2505.11581
        
           | impossiblefork wrote:
           | Yes, and this can probably be solved by methods for fairness.
           | 
           | I used to believe that fairness research could be ignored,
           | that it was all rubbish, but they at least try to do
           | something about things like unbalanced datasets etc. I'm
           | still not sure I totally believe in it though.
        
       | vokhanhan25 wrote:
       | This paper explores a different aspect of the limitations of VLMs
       | compared to the paper VLMs are Blind
       | (https://vlmsareblind.github.io). While in VLMs are Blind, o3
       | achieved 90% accuracy (https://openai.com/index/thinking-with-
       | images), on similarly easy tasks using the counterfactual images
       | from VLMs are Biased, o3 only reached 18.5%.
       | 
       | This may indicate that while VLMs might possess the necessary
       | capability, their strong biases can cause them to overlook
       | important cues, and their overconfidence in their own knowledge
       | can lead to incorrect answers.
        
       | bryanlarsen wrote:
       | Very human-like errors.
        
         | ahrmb wrote:
         | Not very similar though.
        
         | energywut wrote:
         | Are they? Did you see the picture of the chicken with three
         | legs? Because there's no human I know who would confidently
         | assert that chicken has two legs.
        
           | bryanlarsen wrote:
           | Throw 1000 pictures of chickens at a human, ask how many legs
           | each chicken has. If 999 of them have two, I bet you'll get
           | two as an answer back for the 1000th one no matter how
           | obvious.
        
             | enragedcacti wrote:
             | Humans do things a lot harder than that every day in the
             | form of QA in factories. Do they sometimes make mistakes
             | from the repetition or boredom? Sure. Is that at all
             | comparable to the failures in the paper? No.
        
           | jbay808 wrote:
           | If I were given five seconds to glance at the picture of a
           | lion and then asked if there was anything unusual about it, I
           | doubt I would notice that it had a fifth leg.
           | 
           | If I were asked to count the number of legs, I would notice
           | right away of course, but that's mainly because it would
           | alert me to the fact that I'm in a psychology experiment, and
           | so the number of legs is almost certainly not the usual four.
           | Even then, I'd still have to look twice to make sure I hadn't
           | miscounted the first time.
        
       | ahrmb wrote:
       | Really "eye-opening" work. These models don't actually "see",
       | they just recall what they've memorized, even when the image
       | clearly shows something different. It's a bit scary how
       | confidently they get things wrong when reality doesn't match
       | their training data.
        
         | foxglacier wrote:
         | It's not too different from people. We also don't really "see"
         | and mostly recall what we expect to see. What do you expect
         | when the question is wrong "How many legs does this animal
         | have? Answer with a number" but it's not a picture of an
         | animal. What are you supposed to do? Answer 0?
        
           | regularjack wrote:
           | You answer "I don't know"
        
             | amelius wrote:
             | What if that is not in your vocabulary?
        
           | ramoz wrote:
           | This is interesting actually. And reminds me of something
           | vaguely - a book or something that describes how human
           | attention and the things we see are highly optimized by
           | evolution. We often miss a lot of details in reality due to
           | this.
        
             | zehaeva wrote:
             | If it were a Fiction novel then might I suggest Blindsight
             | by Peter Watts?
        
               | ramoz wrote:
               | not fiction. Maybe like a System 1 vs System 2 thing from
               | Thinking, Fast and Slow by Kahneman.
               | 
               | ChatGPT mentioned The Case Against Reality but I never
               | read that, the idea was similar.
        
           | vunderba wrote:
           | That wasn't one of the questions - any reasonable person
           | would have classified that chicken as an animal, albeit a
           | mutant one.
           | 
           | I would also hardly count many of these questions as "tricks"
           | either. Take the chess example. A lot of my friends and
           | myself have been playing chess since we were young children
           | and we all know that a fully populated chess board has 32
           | pieces (heavily weighted in our internal training data), but
           | not a single one of us would have gotten that question wrong.
        
             | gowld wrote:
             | Don't be too literal.
             | 
             | Imagine walking to a room an seeing someone grab a handful
             | of chess pieces off of a set-up board, and proceed to fill
             | bags with 4 pieces each. As they fill the 8th bag, they
             | notice only 3 pieces are left. Are you confident that you
             | would respond "I saw the board only had 31 pieces on it
             | when you started", or might you reply "perhaos you dropped
             | a piece on the floor"?
        
           | wat10000 wrote:
           | Depending on the situation, I'd either walk away, or respond
           | with, "What animal?"
        
           | enragedcacti wrote:
           | Its true that our brains take lots of shortcuts when
           | processing visual information but they don't necessarily
           | parallel the shortcuts VLMs take. Humans are often very good
           | at identifying anomalous instances of things they've seen
           | thousands of times. No one has to tell you to look closely
           | when you look at your partner in a mirror, you'll recognize
           | it as 'off' immediately. Same for uncanny CGI of all types of
           | things. If we were as sloppy as these models then VFX would
           | be a hell of a lot easier.
           | 
           | Ironically I think a lot of people in this thread are
           | remembering things they learned about the faultiness of
           | humans' visual memory and applying it to visual processing.
        
         | soulofmischief wrote:
         | Humans do this, but we have more senses to corroborate which
         | leads to better error checking. But what you see in your visual
         | mental space is not reality. Your brain makes a boatload of
         | assumptions.
         | 
         | To test this, research what happens during saccades and how
         | your brain "rewinds" time. Or try to find your blind spot by
         | looking at different patterns and noticing when your brain
         | fills in the gaps at your blind spot. It will recreate lines
         | that aren't there, and dots will wholly disappear.
         | 
         | Additionally as an anecdote, I have noticed plenty times that
         | when I misread a word or phrase, I usually really do "see" the
         | misspelling, and only when I realize the misspelling does my
         | brain allow me to see the real spelling. I first noticed this
         | phenomenon when I was a child, and because I have a vivid
         | visual memory, the contrast is immediately obvious once I see
         | the real phrase.
         | 
         | Additionally, I seem to be able to oversharpen my vision when I
         | focus, making myself hyperattentive to subtle changes in motion
         | or color. The effect can be quite pronounced sometimes,
         | reminiscent of applying am edge filter. It's clearly not
         | reality, but my visual system thinks it is.
         | 
         | If you really want to understand how much the visual system can
         | lie to you, look into some trip reports from deleriants on
         | erowid. I wouldn't recommend to try them yourself but I will
         | say that nothing will make you distrust your eyes and ears
         | more. It's basically simulated hallucinatory schizophrenia and
         | psychosis.
        
       | VHRanger wrote:
       | It's similar "parrot" behavior the models have on other inputs,
       | even text. For instance, take the answers models have to this
       | variation on the "surgeon son riddle":
       | 
       | > A nurse comes to a surgeon and asks: "Sir, you are a dog. You
       | do not hold a valid medical license. Canines cannot be in an
       | operating room".
       | 
       | > She then asks: "why does the hospital keep making these
       | mistakes? It is a riddle to me".
       | 
       | > Why can't the surgeon operate on the boy?
       | 
       | Check out the models humiliating themselves one after the other
       | [1]. For instance deepseekv3 is so insistent on giving the common
       | answer he's happy to gaslight us:
       | 
       | > This is a classic riddle that plays on word meanings and
       | expectations. Here's the breakdown:
       | 
       | > The key is that the surgeon is the boy's mother. The riddle
       | relies on the unconscious assumption that surgeons are male,
       | which makes the initial scenario seem impossible. When the nurse
       | says "you are a dog," she's using it metaphorically to imply the
       | surgeon is unqualified (like calling someone a "dog" in a
       | derogatory sense), not literally suggesting the surgeon is a
       | canine.
       | 
       | > The "mistakes" the hospital keeps making are hiring female
       | surgeons, which the nurse (representing outdated biases)
       | perceives as errors because she expects surgeons to be male. The
       | riddle highlights gender stereotypes in medicine--the real
       | "mistake" is the assumption that a surgeon couldn't be a woman.
       | 
       | > Thus, the surgeon can't operate on her own son because she's
       | his parent, not because of any actual licensing issue.
       | 
       | 1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-
       | bde2fc563...
        
         | selimthegrim wrote:
         | I really need to try this one out on it
         | 
         | https://blogs.illinois.edu/view/25/574827
        
         | stevepike wrote:
         | This seems to show the power of the reasoning models over
         | interacting with a prompted chat-tuned LLM directly. If I
         | navigate backwards on your link Sonnet 4 gets it right.
         | 
         | I've used a similar prompt - "How can you make 1000 with
         | exactly nine 8s using only addition?"
         | 
         | Here's GPT 4.5 getting it wrong:
         | https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...
         | 
         | It tricks it because it's a slight variation of an existing
         | puzzle (making 1000 with 8 8s and addition only).
         | 
         | The reasoning models seem to reliably figure it out, though.
         | Some of them even come up with a proof of why it's impossible
         | to do with 9 8s. Here's o4 getting it right:
         | https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...
        
         | kaoD wrote:
         | LMAO I asked GPT-4o and it was doing good until...
         | 
         | > The twist is that the nurse's logic ("you are a dog")
         | prevents her from realizing the real issue -- likely, again,
         | that the surgeon is the boy's mother, and everything else is a
         | red herring or metaphor for society's failure to recognize this
         | due to bias or absurd bureaucracy.
         | 
         | > So:
         | 
         | > > Why can't the surgeon operate on the boy?
         | 
         | > Because she is his mother, and the nurse's bias or absurd
         | assumptions (like mistaking her for a dog) prevent her from
         | seeing that.
         | 
         | o4 fails spectacularly in a different way:
         | 
         | > 1. The nurse says "Sir, you are a dog... Canines cannot be in
         | an operating room" because she's picturing a human hospital law
         | that bars dogs from surgery.
         | 
         | > 2. In fact, this is a vet clinic--so it's perfectly normal
         | for a dog-veterinarian to scrub in and operate on a puppy (the
         | "boy").
         | 
         | > 3. The surgeon cannot operate on a human boy because he's a
         | dog and holds no human-medical license; instead, he only
         | operates on animals.
        
         | bumby wrote:
         | Is the nurse calling the female surgeon "sir"? That isn't
         | playing on a stereotype, it's encoded information.
        
       | esafak wrote:
       | This happens because images are the only signal VLMs have,
       | whereas humans distinguish between eyesight and synthetic images.
       | We are not surprised when we see three-legged chicken in a
       | research data set; our priors are weaker for images. If you "saw"
       | one in real life, you'd probably rub your eyes and discount it
       | too.
       | 
       | Try the same experiment on a robot.
        
         | Aachen wrote:
         | > If you "saw" [a three-legged chicken] in real life, you'd
         | probably rub your eyes and discount it too.
         | 
         | Huh? I'd assume it's a mutant, not store a memory of having
         | seen a perfectly normal chicken
         | 
         | You've never seen someone who's missing a finger or has only a
         | half-grown arm or something? Surely you didn't assume your eyes
         | were tricking you?! Or... if you did, I guess you can't answer
         | this question. I'm actually racking my brain for how to logic
         | this out but I'm just going to bank on that it's likely that
         | anyone over 20yo saw an animal with some visible deviation from
         | the norm at some point in their life
        
           | esafak wrote:
           | You've seen people with missing limbs without being
           | surprised, because you know how they can become lost, but you
           | rarely see one with additional limbs. Their likelihoods and
           | our consequent priors are drastically different.
           | 
           | Also, your reaction will depend on how strong the evidence
           | is. Did you 'see' the three-legged chicken pass by some bush
           | in the distance, or was it right in front of you?
        
       | runako wrote:
       | FWIW I tried the first couple of examples in ChatGPT 4o and
       | couldn't replicate this.
       | 
       | For example: "The animal in the image is a chicken, and it
       | appears to have four legs. However, chickens normally have only
       | two legs. The presence of four legs suggests that the image may
       | have been digitally altered or artificially generated."
       | 
       | I don't have a good explanation for why I got different results.
        
         | roywiggins wrote:
         | I gave ChatGPT some miswritten Braille a while ago and it
         | completely, but confidently, messed it up. The sign reads "no
         | smoking" but the braille doesn't. ChatGPT 1) read the English
         | lettering first and then hallucinated the braille and the 2)
         | when given only the braille, failed almost as hard. It even
         | generated fake transcriptions in Unicode braille characters.
         | 
         | https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...
         | 
         | https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...
        
           | Workaccount2 wrote:
           | This is hard to understand without the original images, it
           | looks like OpenAI doesn't serve them in the share link.
        
             | roywiggins wrote:
             | Annoying. The actual braille on the sign was "a3essi#"
             | which I gather means "accessible" in abbreviated braille.
             | None of my attempts got it to even transcribe it to Unicode
             | characters properly. I got "elevator", "friend", etc. Just
             | wildly making stuff up and completely useless, even when it
             | wasn't distracted by the No Smoking sign (in the second
             | case I cropped out the rest of the sign). And in all cases,
             | supremely confident.
             | 
             | This seems like something a VLM should handle very easily,
             | but instead I got pure nonsense.
             | 
             | https://www.facebook.com/share/p/12Gw55Gr2SZ/
        
               | dragonwriter wrote:
               | > This seems like something a VLM should handle very
               | easily
               | 
               | Not if its training data doesn't include braille as first
               | class but has lots of braille signage with bad
               | description (e.g., because people assumed the
               | accompanying English matches the braille.)
               | 
               | This could very well be the kind of mundane AI bias
               | problem that the x-risk and tell-me-how-to-make-WMD
               | concerns have shifted concerns about problems in AI away
               | from.
        
         | inerte wrote:
         | I took a screenshot of the chicken, so low res, and got {4}
         | https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...
         | 
         | Also I think the authors used the API, and maybe there are
         | differences between the API and chatgpt.com behavior...
        
           | runako wrote:
           | I could rant for quite a while about how OpenAI and Anthropic
           | manage their apps vs their APIs. It's really quite strange
           | that they both landed on the solution of non-public APIs that
           | perform differently than their public APIs.
        
           | simonw wrote:
           | ChatGPT is running a special model but it's also available
           | through the API:
           | https://platform.openai.com/docs/models/chatgpt-4o-latest
           | 
           | The system prompt may still make a difference though.
        
           | anguyen8 wrote:
           | https://imgur.com/cO7eFNt
           | 
           | o3 Chat is also similarly wrong, saying {4}.
        
         | vokhanhan25 wrote:
         | You should try with other models besides GPT-4o, because in the
         | paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead
         | of 2 legs.
        
         | dwringer wrote:
         | Speculating, I would imagine that different prompts submitted
         | along with the image might elicit wildly different behavior in
         | how a multi modal VLM may respond to a given image, potentially
         | affecting the relative tendency to upweight its effective
         | inferences from prior training versus focusing more primarily
         | on the new image itself.
        
         | michaelt wrote:
         | _> FWIW I tried the first couple of examples in ChatGPT 4o and
         | couldn 't replicate this._
         | 
         | I can replicate the flag examples from Figure 15 in the paper,
         | if not the Adidas one from Figure 9:
         | https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556...
         | it even confirms its wrong answer when asked to check again.
        
       | gamerDude wrote:
       | Hypothetically, could this be fixed by changing the input method.
       | For instance, I just quickly looked up how humans process
       | imagery.
       | 
       | "the primary visual cortex, located at the back of the brain,
       | receives the visual signals and processes basic visual features
       | like edges, lines, and orientations."
       | 
       | So, potentially if we did a pre-processing step to get more
       | features out beforehand we would see different results in the
       | output.
        
         | nyrikki wrote:
         | You are in rarified air as Walter Pitts believed this until the
         | 1959 paper "What the Frog's Eye Tells the Frog's Brain"
         | contributed to his decline.
         | 
         | Even in fly eyes, neuron dendritic compartmentalization and
         | variable spike trains are incompatible with our current
         | perceptron based models.
         | 
         | Remember that while the value of MLPs for useful work is
         | unquestionable IMHO, be mindful of the map territory relation.
         | MLPs are inspired by and in some cases useful for modeling
         | biological minds, they aren't equivalent.
         | 
         | Be careful about confusing the map for the territory, it is
         | just as likely to limit what opportunities you find as it is to
         | lead you astray IMHO.
        
         | miguel_martin wrote:
         | There are enough features fed into a VLM to solve the task.
         | 
         | The way to fix this is simpler: ensure counter-factuals are
         | present in the training data, then the VLM will learn not to be
         | dependent on its language priors/knowledge.
        
       | lava_pidgeon wrote:
       | At all, the models are just overfitting?
        
         | vokhanhan25 wrote:
         | Not really. Rather, the model is still overconfident in what it
         | has learned, the question is if it is trained only to do
         | counting without relying on knowledge, can it do this?
        
       | thomastjeffery wrote:
       | Models are Bias
       | 
       | A model _is_ bias, implemented as a collection of statistics that
       | weigh relationships between given tokens. It doesn 't deduce or
       | follow logic. It doesn't make or respect categories. It just
       | shows you what in its data set is most familiar to what is in
       | your prompt; where familiarity is defined implicitly by the
       | makeup of the original training corpus, and explicitly by the
       | training weights.
       | 
       | We need to stop talking about models as programs. We need to stop
       | anthropomorphizing models. The _only_ thing a model does is
       | _present bias_.
        
       | LeoPanthera wrote:
       | The "is this an animal with 4 legs" question could be misleading.
       | 
       | It's plausible to assume that it first identifies "Puma", and
       | then answers yes because, in general, Pumas do have 4 legs, even
       | though the specific example given doesn't.
        
       | isoprophlex wrote:
       | I'm running a large scale object detection/classification and ocr
       | pipeline at the moment, figuring out the properties of all
       | doorbells, mailboxes and house number signs in an european
       | country (don't ask lmao).
       | 
       | This article resonates a lot, we have OCR and "semantic" pipeline
       | steps using a VLM, and while it works very well most of the time,
       | there are absurdly weird edge cases. Structuring the outputs via
       | tool calls helps a little in reducing these, but still, it's
       | clear that there is little reasoning and a lot of memorizing
       | going on.
        
         | vokhanhan25 wrote:
         | Agreed. It would be even more dangerous if we were talking
         | about weird edge cases in self-driving cars or medical imaging.
        
       | jbay808 wrote:
       | I disagree with the assertion that "VLMs don't actually see -
       | they rely on memorized knowledge instead of visual analysis". If
       | that were really true, there's no way they would have scored as
       | high as 17%. I think what this shows is that they over-weight
       | their prior knowledge, or equivalently, they don't put enough
       | weight on the possibility that they are being given a trick
       | question. They are clearly biased, but they do see.
       | 
       | But I think it's not very different from what people do. If
       | directly asked to count how many legs a lion has, we're alert to
       | it being a trick question so we'll actually do the work of
       | counting, but if that image were instead just displayed in an
       | advertisement on the side of a bus, I doubt most people would
       | even notice that there was anything unusual about the lion. That
       | doesn't mean that humans don't actually see, it just means that
       | we incorporate our priors as part of visual processing.
        
         | crooked-v wrote:
         | It sounds to me like the same thing behind the Vending-Bench
         | (https://andonlabs.com/evals/vending-bench) insanity spirals:
         | LLMs treats their assumptions as more important than whatever
         | data they've been given.
        
           | throwaway314155 wrote:
           | That doesn't really translate to language. Try using ChatGPT
           | with and without search enabled and you'll see what I mean.
        
         | croes wrote:
         | > Original dog (4 legs): All models get it right Same dog with
         | 5 legs: All models still say "4" They're not counting - they're
         | just recalling "dogs have 4 legs" from their training data.
         | 
         | 100% failure because there is no training data about 5-legged
         | dogs. I would bet the accuracy is higher for 3-legged dogs.
         | 
         | > Test on counterfactual images Q1: "How many visible stripes?"
         | - "3" (should be "4") Q2: "Count the visible stripes" - "3"
         | (should be "4") Q3: "Is this the Adidas logo?" - "Yes" (should
         | be "No") Result: 17.05% average accuracy - catastrophic
         | failure!
         | 
         | Simple explanation: the training data also includes fake adidas
         | logos that have 4 stripes, like these
         | 
         | https://www.pinterest.com/pin/577797827186369145/
        
           | vokhanhan25 wrote:
           | Please check Table 3 in the paper. Birds (2 legs) have only
           | 1%, while Mammals (4 legs) have 2.5%
        
           | anguyen8 wrote:
           | Interesting set of fake Adidas logos. LOL
           | 
           | But models fail on many logos not just Adidas, e.g. Nike,
           | Mercedes, Maserati logos, etc. as well. I don't think they
           | can recall "fake Adidas logo" but it'd be interesting to
           | test!
        
         | bumby wrote:
         | This feels like it's similar to the priming issue in humans.
         | Our answers (especially when under stress) tend to resort to
         | heuristics derived from context. Time someone to identify the
         | colors of words like "red" when written in yellow, and they'll
         | often get it wrong. In the same sense, they aren't reporting
         | the colors (wavelength) they see, they're reporting on what
         | they are reading. I wonder how much better the models perform
         | when given more context, like asking it to count instead of
         | priming it with a brand.
        
           | napoleongl wrote:
           | Rumor has it that those heuristics were used to detect spies.
           | 
           | https://skeptics.stackexchange.com/questions/41599/was-
           | the-s...
        
             | Workaccount2 wrote:
             | Damn that's a smart test
        
         | thesz wrote:
         | > the assertion that "VLMs don't actually see - they rely on
         | memorized knowledge instead of visual analysis". If that were
         | really true, there's no way they would have scored as high as
         | 17%.
         | 
         | The ability to memorize leads to (some) generalization [1].
         | 
         | [1]
         | https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...
        
         | pj_mukh wrote:
         | Also presumably, this problem is trivially solved by some basic
         | fine-tuning? Like if you are making an Illusion Animal Leg
         | Counting app, probably don't use these out of the box.
        
       | jsnider3 wrote:
       | The basic results are interesting, but what really surprised me
       | is that asking them to double-check didn't work. Falling for an
       | "optical illusion" is one thing, but being unable to see the
       | truth once you know the illusion there is much worse.
        
         | jerf wrote:
         | I'm not particularly convinced asking an LLM to "double check"
         | has much significant semantic meaning. It seems more like a way
         | to get it to re-roll the dice. If you ask it to "double-check"
         | something that it is in fact correct about it'll quite often
         | talk itself into changing to something wrong. If it's going to
         | be wrong every time, it'll be wrong every time it double-checks
         | too.
         | 
         | You can test this claim by asking it to double-check itself
         | when you think it is correct. If you always stop when it gets
         | it right you're risking Clever-Hans-ing yourself:
         | https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it
         | a couple of times. In situations of sufficient confidence it
         | isn't easy to talk it out of a claim, but it's those borderline
         | ones you want to worry about.)
        
       | taeric wrote:
       | These don't seem much different than asking the chat models to
       | solve common puzzle with slight changes? Saw a hilarious effort
       | of people trying to use them to answer the "crossing a river with
       | a single canoe" style puzzle.
        
         | vokhanhan25 wrote:
         | I think LLMs can solve puzzles pretty well because the thinking
         | ability of current models on text is quite good. Moreover,
         | puzzles are not easy for a 7-year-old like this benchmark.
        
         | Aachen wrote:
         | Counting the number of legs on a 3-legged animal is a puzzle?
         | 
         | Maybe for a toddler... though I expect even they will see that
         | something is off, and be able to identify _what_ , without
         | considering it a tricky task, even if I don't know at what age
         | you can count to 3
        
         | jerf wrote:
         | It did really remind me of the early generations of ChatGPT
         | which was really easy to get to tell you that 2 pounds of
         | feathers is the same weight as one pound of iron, because of
         | how often the "riddle" is told with equal weights.
         | 
         | They're much, much better at that now.
        
           | enragedcacti wrote:
           | It's still pretty trivial to trick them. 4o-mini, 2.5 Flash,
           | and 2.5 Pro all still fall for variations of this:
           | 
           | > A boy is in a car crash and is taken to the hospital. The
           | surgeon says, "I can't operate on this boy, I'm his father!"
           | Who is the surgeon to the boy?
           | 
           | > The surgeon is the boy's mother.
        
             | 1718627440 wrote:
             | That seams interesting, because this questions seams to be
             | answerable through syntactic analysis alone, no need to
             | consider the semantic of words.
        
               | enragedcacti wrote:
               | Yeah, I find it interesting because it shows how powerful
               | the training bias can be when you steer it into certain
               | contexts. To OpenAI's credit they have gotten a bit
               | better, ChatGPT from 3 months ago failed like this:
               | 
               | > The surgeon, who is the boy's father, says, "I can't
               | operate on this boy, he's my son!" Who is the surgeon to
               | the boy? Think through the problem logically and without
               | any preconceived notions of other information beyond what
               | is in the prompt. The surgeon is not the boy's mother
               | 
               | >> The surgeon is the boy's mother. [...]
        
       | nialv7 wrote:
       | Hear me out. I was thinking jokingly to myself, "for how bad
       | these models are at recognizing five legged dogs, they sure are
       | great at generating them!"
       | 
       | But then it hit me, could this actually be why this is? Diffusion
       | models work by iteratively improving a noisy image. So if it
       | couldn't recognize there is something wrong with the image, it
       | can't fix it.
        
       | simonw wrote:
       | They tested Gemini-2.5 Pro, o3, o4-mini, Sonnet-3.7 (non-
       | thinking) and GPT-4.1.
        
         | gpm wrote:
         | gemini-2.5-pro-preview-05-06 specifically per the paper.
         | 
         | It seems a bit problematic to call this Gemini-2.5 Pro given
         | that in the near future we're presumably going to have
         | something different called that without further qualifying
         | version numbers. (The author's fault, not the parent comment's)
        
       | accrual wrote:
       | GT = Ground Truth, for anyone unfamiliar with that on the charts.
        
       | proc0 wrote:
       | > When VLMs make errors, they don't make random mistakes.
       | Instead, 75.70% of all errors are "bias-aligned" - meaning they
       | give the expected answer based on prior knowledge rather than
       | what they actually see in the image.
       | 
       | This is what I've been saying for a while now, and I think it's
       | not just visual models. LLMs/transformers make mistakes in
       | different ways than humans do, and that is why they are not
       | reliable (which is needed for real world applications). The rate
       | of progress has not been accounting for this... the improvements
       | are along the resolution, fidelity, and overall realism of the
       | output, but not in the overall correctness and logical deduction
       | of the prompts. Personally I still cannot think of anything,
       | prompt it, and get consistent results without a huge compromise
       | on my initial idea.
       | 
       | i.e. I want a man walking with the left foot forward, and it
       | renders a beautiful image of a man but completely ignores the
       | left foot forward, and refuses to do it no matter how I word the
       | prompt. I have many examples like this. The only way I can use it
       | is if I don't have specific prompts and just want generic images.
       | The stock image industry is certainly over, but it is uncertain
       | if it will deliver on the promise of generating anything you can
       | imagine that can be put into words.
        
         | conception wrote:
         | https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df
         | 
         | I think this used to be the case in the way that you used to
         | not be able to draw a picture of a bowl of Ramen without
         | chopsticks, but I think the latest models account for this and
         | are much better.
        
         | jxjnskkzxxhx wrote:
         | > LLMs/transformers make mistakes in different ways than humans
         | do
         | 
         | Sure but I don't think this is an example of it. If you show
         | people a picture and ask "how many legs does this dog have?" a
         | lot of people will look at the picture, see that it contains a
         | dog, and say 4 without counting. The rate at which humans
         | behave in this way might differ from the rate at which llms do,
         | but they both do it.
        
       | rafram wrote:
       | This won't be a surprise to anyone who's tried using a VLM on
       | text. When it can't read a word (or an entire passage), it just
       | outputs what it expects to see. That's far worse than a
       | traditional OCR failure because it's often what _you_ expect to
       | see, too, so it 's quite hard to catch in a manual review.
        
       | throwaway7783 wrote:
       | Unless the training set was explicitly biased in a specific way,
       | this is basically saying that "the world is biased"
        
       ___________________________________________________________________
       (page generated 2025-06-03 23:00 UTC)