[HN Gopher] Vision Language Models Are Biased
___________________________________________________________________
Vision Language Models Are Biased
Author : taesiri
Score : 105 points
Date : 2025-06-03 12:47 UTC (10 hours ago)
(HTM) web link (vlmsarebiased.github.io)
(TXT) w3m dump (vlmsarebiased.github.io)
| taesiri wrote:
| State-of-the-art Vision Language Models achieve 100% accuracy
| counting on images of popular subjects (e.g. knowing that the
| Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17%
| accurate in counting in counterfactual images (e.g. counting
| stripes in a 4-striped Adidas-like logo or counting legs in a
| 5-legged dog).
| LorenDB wrote:
| There's no need to repeat what is said at the top of the linked
| webpage.
| shenkha wrote:
| fun findings related to memorization of AI models. It simply
| means LLMs/VLLMs do not know how to predict generally but
| memorizing instead. A new perspective on adversarial attack
| methods.
| taesiri wrote:
| for overly represented concepts, like popular brands, it seems
| that the model "ignores" the details once it detects that the
| overall shapes or patterns are similar. Opening up the vision
| encoders to find out how these images cluster in the embedding
| space should provide better insights.
| kmeisthax wrote:
| If there aren't any five-legged dogs in your trainset, it's
| safer[0] to just remember that all dogs are four-legged than
| to actually recognize and count legs. After all, you might
| have a few images of dogs in your trainset that are
| misleading enough to look five-legged (e.g. because a dog is
| in front of another dog).
|
| Overrepresentation is a _different_ source of bias. That 's
| what gives you, say, image generators that always draw
| "golden 1970s sci-fi robot" as C3-PO even when given
| additional instructions to draw something else.
|
| Both of these problems are manifestations of the difference
| between training and deployment distributions. Ok, I _guess_
| you could say that four-legged dogs are "overrepresented" in
| the training set, but that's because four-legged dogs are
| also overrepresented _in reality_. The deployment
| distribution doesn 't have five-legged dogs in it. What we've
| done is instead concoct an adversarial distribution to
| _force_ a train /deploy gap where none would exist.
|
| Releasing the vision encoder won't help because weights are
| opaque. Stochastic gradient descent does not yield functional
| internal representations[1]; it fills the bucket of
| parameters with one distribution and one distribution only.
| We could tell if, say the vision encoder produces identical
| embeddings for dogs regardless of leg count, or some other
| counterfactuals; but not much more than that.
|
| [0] Lower loss and possibly lower L2-norm
|
| [1] https://arxiv.org/abs/2505.11581
| impossiblefork wrote:
| Yes, and this can probably be solved by methods for fairness.
|
| I used to believe that fairness research could be ignored,
| that it was all rubbish, but they at least try to do
| something about things like unbalanced datasets etc. I'm
| still not sure I totally believe in it though.
| vokhanhan25 wrote:
| This paper explores a different aspect of the limitations of VLMs
| compared to the paper VLMs are Blind
| (https://vlmsareblind.github.io). While in VLMs are Blind, o3
| achieved 90% accuracy (https://openai.com/index/thinking-with-
| images), on similarly easy tasks using the counterfactual images
| from VLMs are Biased, o3 only reached 18.5%.
|
| This may indicate that while VLMs might possess the necessary
| capability, their strong biases can cause them to overlook
| important cues, and their overconfidence in their own knowledge
| can lead to incorrect answers.
| bryanlarsen wrote:
| Very human-like errors.
| ahrmb wrote:
| Not very similar though.
| energywut wrote:
| Are they? Did you see the picture of the chicken with three
| legs? Because there's no human I know who would confidently
| assert that chicken has two legs.
| bryanlarsen wrote:
| Throw 1000 pictures of chickens at a human, ask how many legs
| each chicken has. If 999 of them have two, I bet you'll get
| two as an answer back for the 1000th one no matter how
| obvious.
| enragedcacti wrote:
| Humans do things a lot harder than that every day in the
| form of QA in factories. Do they sometimes make mistakes
| from the repetition or boredom? Sure. Is that at all
| comparable to the failures in the paper? No.
| jbay808 wrote:
| If I were given five seconds to glance at the picture of a
| lion and then asked if there was anything unusual about it, I
| doubt I would notice that it had a fifth leg.
|
| If I were asked to count the number of legs, I would notice
| right away of course, but that's mainly because it would
| alert me to the fact that I'm in a psychology experiment, and
| so the number of legs is almost certainly not the usual four.
| Even then, I'd still have to look twice to make sure I hadn't
| miscounted the first time.
| ahrmb wrote:
| Really "eye-opening" work. These models don't actually "see",
| they just recall what they've memorized, even when the image
| clearly shows something different. It's a bit scary how
| confidently they get things wrong when reality doesn't match
| their training data.
| foxglacier wrote:
| It's not too different from people. We also don't really "see"
| and mostly recall what we expect to see. What do you expect
| when the question is wrong "How many legs does this animal
| have? Answer with a number" but it's not a picture of an
| animal. What are you supposed to do? Answer 0?
| regularjack wrote:
| You answer "I don't know"
| amelius wrote:
| What if that is not in your vocabulary?
| ramoz wrote:
| This is interesting actually. And reminds me of something
| vaguely - a book or something that describes how human
| attention and the things we see are highly optimized by
| evolution. We often miss a lot of details in reality due to
| this.
| zehaeva wrote:
| If it were a Fiction novel then might I suggest Blindsight
| by Peter Watts?
| ramoz wrote:
| not fiction. Maybe like a System 1 vs System 2 thing from
| Thinking, Fast and Slow by Kahneman.
|
| ChatGPT mentioned The Case Against Reality but I never
| read that, the idea was similar.
| vunderba wrote:
| That wasn't one of the questions - any reasonable person
| would have classified that chicken as an animal, albeit a
| mutant one.
|
| I would also hardly count many of these questions as "tricks"
| either. Take the chess example. A lot of my friends and
| myself have been playing chess since we were young children
| and we all know that a fully populated chess board has 32
| pieces (heavily weighted in our internal training data), but
| not a single one of us would have gotten that question wrong.
| gowld wrote:
| Don't be too literal.
|
| Imagine walking to a room an seeing someone grab a handful
| of chess pieces off of a set-up board, and proceed to fill
| bags with 4 pieces each. As they fill the 8th bag, they
| notice only 3 pieces are left. Are you confident that you
| would respond "I saw the board only had 31 pieces on it
| when you started", or might you reply "perhaos you dropped
| a piece on the floor"?
| wat10000 wrote:
| Depending on the situation, I'd either walk away, or respond
| with, "What animal?"
| enragedcacti wrote:
| Its true that our brains take lots of shortcuts when
| processing visual information but they don't necessarily
| parallel the shortcuts VLMs take. Humans are often very good
| at identifying anomalous instances of things they've seen
| thousands of times. No one has to tell you to look closely
| when you look at your partner in a mirror, you'll recognize
| it as 'off' immediately. Same for uncanny CGI of all types of
| things. If we were as sloppy as these models then VFX would
| be a hell of a lot easier.
|
| Ironically I think a lot of people in this thread are
| remembering things they learned about the faultiness of
| humans' visual memory and applying it to visual processing.
| soulofmischief wrote:
| Humans do this, but we have more senses to corroborate which
| leads to better error checking. But what you see in your visual
| mental space is not reality. Your brain makes a boatload of
| assumptions.
|
| To test this, research what happens during saccades and how
| your brain "rewinds" time. Or try to find your blind spot by
| looking at different patterns and noticing when your brain
| fills in the gaps at your blind spot. It will recreate lines
| that aren't there, and dots will wholly disappear.
|
| Additionally as an anecdote, I have noticed plenty times that
| when I misread a word or phrase, I usually really do "see" the
| misspelling, and only when I realize the misspelling does my
| brain allow me to see the real spelling. I first noticed this
| phenomenon when I was a child, and because I have a vivid
| visual memory, the contrast is immediately obvious once I see
| the real phrase.
|
| Additionally, I seem to be able to oversharpen my vision when I
| focus, making myself hyperattentive to subtle changes in motion
| or color. The effect can be quite pronounced sometimes,
| reminiscent of applying am edge filter. It's clearly not
| reality, but my visual system thinks it is.
|
| If you really want to understand how much the visual system can
| lie to you, look into some trip reports from deleriants on
| erowid. I wouldn't recommend to try them yourself but I will
| say that nothing will make you distrust your eyes and ears
| more. It's basically simulated hallucinatory schizophrenia and
| psychosis.
| VHRanger wrote:
| It's similar "parrot" behavior the models have on other inputs,
| even text. For instance, take the answers models have to this
| variation on the "surgeon son riddle":
|
| > A nurse comes to a surgeon and asks: "Sir, you are a dog. You
| do not hold a valid medical license. Canines cannot be in an
| operating room".
|
| > She then asks: "why does the hospital keep making these
| mistakes? It is a riddle to me".
|
| > Why can't the surgeon operate on the boy?
|
| Check out the models humiliating themselves one after the other
| [1]. For instance deepseekv3 is so insistent on giving the common
| answer he's happy to gaslight us:
|
| > This is a classic riddle that plays on word meanings and
| expectations. Here's the breakdown:
|
| > The key is that the surgeon is the boy's mother. The riddle
| relies on the unconscious assumption that surgeons are male,
| which makes the initial scenario seem impossible. When the nurse
| says "you are a dog," she's using it metaphorically to imply the
| surgeon is unqualified (like calling someone a "dog" in a
| derogatory sense), not literally suggesting the surgeon is a
| canine.
|
| > The "mistakes" the hospital keeps making are hiring female
| surgeons, which the nurse (representing outdated biases)
| perceives as errors because she expects surgeons to be male. The
| riddle highlights gender stereotypes in medicine--the real
| "mistake" is the assumption that a surgeon couldn't be a woman.
|
| > Thus, the surgeon can't operate on her own son because she's
| his parent, not because of any actual licensing issue.
|
| 1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-
| bde2fc563...
| selimthegrim wrote:
| I really need to try this one out on it
|
| https://blogs.illinois.edu/view/25/574827
| stevepike wrote:
| This seems to show the power of the reasoning models over
| interacting with a prompted chat-tuned LLM directly. If I
| navigate backwards on your link Sonnet 4 gets it right.
|
| I've used a similar prompt - "How can you make 1000 with
| exactly nine 8s using only addition?"
|
| Here's GPT 4.5 getting it wrong:
| https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...
|
| It tricks it because it's a slight variation of an existing
| puzzle (making 1000 with 8 8s and addition only).
|
| The reasoning models seem to reliably figure it out, though.
| Some of them even come up with a proof of why it's impossible
| to do with 9 8s. Here's o4 getting it right:
| https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...
| kaoD wrote:
| LMAO I asked GPT-4o and it was doing good until...
|
| > The twist is that the nurse's logic ("you are a dog")
| prevents her from realizing the real issue -- likely, again,
| that the surgeon is the boy's mother, and everything else is a
| red herring or metaphor for society's failure to recognize this
| due to bias or absurd bureaucracy.
|
| > So:
|
| > > Why can't the surgeon operate on the boy?
|
| > Because she is his mother, and the nurse's bias or absurd
| assumptions (like mistaking her for a dog) prevent her from
| seeing that.
|
| o4 fails spectacularly in a different way:
|
| > 1. The nurse says "Sir, you are a dog... Canines cannot be in
| an operating room" because she's picturing a human hospital law
| that bars dogs from surgery.
|
| > 2. In fact, this is a vet clinic--so it's perfectly normal
| for a dog-veterinarian to scrub in and operate on a puppy (the
| "boy").
|
| > 3. The surgeon cannot operate on a human boy because he's a
| dog and holds no human-medical license; instead, he only
| operates on animals.
| bumby wrote:
| Is the nurse calling the female surgeon "sir"? That isn't
| playing on a stereotype, it's encoded information.
| esafak wrote:
| This happens because images are the only signal VLMs have,
| whereas humans distinguish between eyesight and synthetic images.
| We are not surprised when we see three-legged chicken in a
| research data set; our priors are weaker for images. If you "saw"
| one in real life, you'd probably rub your eyes and discount it
| too.
|
| Try the same experiment on a robot.
| Aachen wrote:
| > If you "saw" [a three-legged chicken] in real life, you'd
| probably rub your eyes and discount it too.
|
| Huh? I'd assume it's a mutant, not store a memory of having
| seen a perfectly normal chicken
|
| You've never seen someone who's missing a finger or has only a
| half-grown arm or something? Surely you didn't assume your eyes
| were tricking you?! Or... if you did, I guess you can't answer
| this question. I'm actually racking my brain for how to logic
| this out but I'm just going to bank on that it's likely that
| anyone over 20yo saw an animal with some visible deviation from
| the norm at some point in their life
| esafak wrote:
| You've seen people with missing limbs without being
| surprised, because you know how they can become lost, but you
| rarely see one with additional limbs. Their likelihoods and
| our consequent priors are drastically different.
|
| Also, your reaction will depend on how strong the evidence
| is. Did you 'see' the three-legged chicken pass by some bush
| in the distance, or was it right in front of you?
| runako wrote:
| FWIW I tried the first couple of examples in ChatGPT 4o and
| couldn't replicate this.
|
| For example: "The animal in the image is a chicken, and it
| appears to have four legs. However, chickens normally have only
| two legs. The presence of four legs suggests that the image may
| have been digitally altered or artificially generated."
|
| I don't have a good explanation for why I got different results.
| roywiggins wrote:
| I gave ChatGPT some miswritten Braille a while ago and it
| completely, but confidently, messed it up. The sign reads "no
| smoking" but the braille doesn't. ChatGPT 1) read the English
| lettering first and then hallucinated the braille and the 2)
| when given only the braille, failed almost as hard. It even
| generated fake transcriptions in Unicode braille characters.
|
| https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...
|
| https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...
| Workaccount2 wrote:
| This is hard to understand without the original images, it
| looks like OpenAI doesn't serve them in the share link.
| roywiggins wrote:
| Annoying. The actual braille on the sign was "a3essi#"
| which I gather means "accessible" in abbreviated braille.
| None of my attempts got it to even transcribe it to Unicode
| characters properly. I got "elevator", "friend", etc. Just
| wildly making stuff up and completely useless, even when it
| wasn't distracted by the No Smoking sign (in the second
| case I cropped out the rest of the sign). And in all cases,
| supremely confident.
|
| This seems like something a VLM should handle very easily,
| but instead I got pure nonsense.
|
| https://www.facebook.com/share/p/12Gw55Gr2SZ/
| dragonwriter wrote:
| > This seems like something a VLM should handle very
| easily
|
| Not if its training data doesn't include braille as first
| class but has lots of braille signage with bad
| description (e.g., because people assumed the
| accompanying English matches the braille.)
|
| This could very well be the kind of mundane AI bias
| problem that the x-risk and tell-me-how-to-make-WMD
| concerns have shifted concerns about problems in AI away
| from.
| inerte wrote:
| I took a screenshot of the chicken, so low res, and got {4}
| https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...
|
| Also I think the authors used the API, and maybe there are
| differences between the API and chatgpt.com behavior...
| runako wrote:
| I could rant for quite a while about how OpenAI and Anthropic
| manage their apps vs their APIs. It's really quite strange
| that they both landed on the solution of non-public APIs that
| perform differently than their public APIs.
| simonw wrote:
| ChatGPT is running a special model but it's also available
| through the API:
| https://platform.openai.com/docs/models/chatgpt-4o-latest
|
| The system prompt may still make a difference though.
| anguyen8 wrote:
| https://imgur.com/cO7eFNt
|
| o3 Chat is also similarly wrong, saying {4}.
| vokhanhan25 wrote:
| You should try with other models besides GPT-4o, because in the
| paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead
| of 2 legs.
| dwringer wrote:
| Speculating, I would imagine that different prompts submitted
| along with the image might elicit wildly different behavior in
| how a multi modal VLM may respond to a given image, potentially
| affecting the relative tendency to upweight its effective
| inferences from prior training versus focusing more primarily
| on the new image itself.
| michaelt wrote:
| _> FWIW I tried the first couple of examples in ChatGPT 4o and
| couldn 't replicate this._
|
| I can replicate the flag examples from Figure 15 in the paper,
| if not the Adidas one from Figure 9:
| https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556...
| it even confirms its wrong answer when asked to check again.
| gamerDude wrote:
| Hypothetically, could this be fixed by changing the input method.
| For instance, I just quickly looked up how humans process
| imagery.
|
| "the primary visual cortex, located at the back of the brain,
| receives the visual signals and processes basic visual features
| like edges, lines, and orientations."
|
| So, potentially if we did a pre-processing step to get more
| features out beforehand we would see different results in the
| output.
| nyrikki wrote:
| You are in rarified air as Walter Pitts believed this until the
| 1959 paper "What the Frog's Eye Tells the Frog's Brain"
| contributed to his decline.
|
| Even in fly eyes, neuron dendritic compartmentalization and
| variable spike trains are incompatible with our current
| perceptron based models.
|
| Remember that while the value of MLPs for useful work is
| unquestionable IMHO, be mindful of the map territory relation.
| MLPs are inspired by and in some cases useful for modeling
| biological minds, they aren't equivalent.
|
| Be careful about confusing the map for the territory, it is
| just as likely to limit what opportunities you find as it is to
| lead you astray IMHO.
| miguel_martin wrote:
| There are enough features fed into a VLM to solve the task.
|
| The way to fix this is simpler: ensure counter-factuals are
| present in the training data, then the VLM will learn not to be
| dependent on its language priors/knowledge.
| lava_pidgeon wrote:
| At all, the models are just overfitting?
| vokhanhan25 wrote:
| Not really. Rather, the model is still overconfident in what it
| has learned, the question is if it is trained only to do
| counting without relying on knowledge, can it do this?
| thomastjeffery wrote:
| Models are Bias
|
| A model _is_ bias, implemented as a collection of statistics that
| weigh relationships between given tokens. It doesn 't deduce or
| follow logic. It doesn't make or respect categories. It just
| shows you what in its data set is most familiar to what is in
| your prompt; where familiarity is defined implicitly by the
| makeup of the original training corpus, and explicitly by the
| training weights.
|
| We need to stop talking about models as programs. We need to stop
| anthropomorphizing models. The _only_ thing a model does is
| _present bias_.
| LeoPanthera wrote:
| The "is this an animal with 4 legs" question could be misleading.
|
| It's plausible to assume that it first identifies "Puma", and
| then answers yes because, in general, Pumas do have 4 legs, even
| though the specific example given doesn't.
| isoprophlex wrote:
| I'm running a large scale object detection/classification and ocr
| pipeline at the moment, figuring out the properties of all
| doorbells, mailboxes and house number signs in an european
| country (don't ask lmao).
|
| This article resonates a lot, we have OCR and "semantic" pipeline
| steps using a VLM, and while it works very well most of the time,
| there are absurdly weird edge cases. Structuring the outputs via
| tool calls helps a little in reducing these, but still, it's
| clear that there is little reasoning and a lot of memorizing
| going on.
| vokhanhan25 wrote:
| Agreed. It would be even more dangerous if we were talking
| about weird edge cases in self-driving cars or medical imaging.
| jbay808 wrote:
| I disagree with the assertion that "VLMs don't actually see -
| they rely on memorized knowledge instead of visual analysis". If
| that were really true, there's no way they would have scored as
| high as 17%. I think what this shows is that they over-weight
| their prior knowledge, or equivalently, they don't put enough
| weight on the possibility that they are being given a trick
| question. They are clearly biased, but they do see.
|
| But I think it's not very different from what people do. If
| directly asked to count how many legs a lion has, we're alert to
| it being a trick question so we'll actually do the work of
| counting, but if that image were instead just displayed in an
| advertisement on the side of a bus, I doubt most people would
| even notice that there was anything unusual about the lion. That
| doesn't mean that humans don't actually see, it just means that
| we incorporate our priors as part of visual processing.
| crooked-v wrote:
| It sounds to me like the same thing behind the Vending-Bench
| (https://andonlabs.com/evals/vending-bench) insanity spirals:
| LLMs treats their assumptions as more important than whatever
| data they've been given.
| throwaway314155 wrote:
| That doesn't really translate to language. Try using ChatGPT
| with and without search enabled and you'll see what I mean.
| croes wrote:
| > Original dog (4 legs): All models get it right Same dog with
| 5 legs: All models still say "4" They're not counting - they're
| just recalling "dogs have 4 legs" from their training data.
|
| 100% failure because there is no training data about 5-legged
| dogs. I would bet the accuracy is higher for 3-legged dogs.
|
| > Test on counterfactual images Q1: "How many visible stripes?"
| - "3" (should be "4") Q2: "Count the visible stripes" - "3"
| (should be "4") Q3: "Is this the Adidas logo?" - "Yes" (should
| be "No") Result: 17.05% average accuracy - catastrophic
| failure!
|
| Simple explanation: the training data also includes fake adidas
| logos that have 4 stripes, like these
|
| https://www.pinterest.com/pin/577797827186369145/
| vokhanhan25 wrote:
| Please check Table 3 in the paper. Birds (2 legs) have only
| 1%, while Mammals (4 legs) have 2.5%
| anguyen8 wrote:
| Interesting set of fake Adidas logos. LOL
|
| But models fail on many logos not just Adidas, e.g. Nike,
| Mercedes, Maserati logos, etc. as well. I don't think they
| can recall "fake Adidas logo" but it'd be interesting to
| test!
| bumby wrote:
| This feels like it's similar to the priming issue in humans.
| Our answers (especially when under stress) tend to resort to
| heuristics derived from context. Time someone to identify the
| colors of words like "red" when written in yellow, and they'll
| often get it wrong. In the same sense, they aren't reporting
| the colors (wavelength) they see, they're reporting on what
| they are reading. I wonder how much better the models perform
| when given more context, like asking it to count instead of
| priming it with a brand.
| napoleongl wrote:
| Rumor has it that those heuristics were used to detect spies.
|
| https://skeptics.stackexchange.com/questions/41599/was-
| the-s...
| Workaccount2 wrote:
| Damn that's a smart test
| thesz wrote:
| > the assertion that "VLMs don't actually see - they rely on
| memorized knowledge instead of visual analysis". If that were
| really true, there's no way they would have scored as high as
| 17%.
|
| The ability to memorize leads to (some) generalization [1].
|
| [1]
| https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...
| pj_mukh wrote:
| Also presumably, this problem is trivially solved by some basic
| fine-tuning? Like if you are making an Illusion Animal Leg
| Counting app, probably don't use these out of the box.
| jsnider3 wrote:
| The basic results are interesting, but what really surprised me
| is that asking them to double-check didn't work. Falling for an
| "optical illusion" is one thing, but being unable to see the
| truth once you know the illusion there is much worse.
| jerf wrote:
| I'm not particularly convinced asking an LLM to "double check"
| has much significant semantic meaning. It seems more like a way
| to get it to re-roll the dice. If you ask it to "double-check"
| something that it is in fact correct about it'll quite often
| talk itself into changing to something wrong. If it's going to
| be wrong every time, it'll be wrong every time it double-checks
| too.
|
| You can test this claim by asking it to double-check itself
| when you think it is correct. If you always stop when it gets
| it right you're risking Clever-Hans-ing yourself:
| https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it
| a couple of times. In situations of sufficient confidence it
| isn't easy to talk it out of a claim, but it's those borderline
| ones you want to worry about.)
| taeric wrote:
| These don't seem much different than asking the chat models to
| solve common puzzle with slight changes? Saw a hilarious effort
| of people trying to use them to answer the "crossing a river with
| a single canoe" style puzzle.
| vokhanhan25 wrote:
| I think LLMs can solve puzzles pretty well because the thinking
| ability of current models on text is quite good. Moreover,
| puzzles are not easy for a 7-year-old like this benchmark.
| Aachen wrote:
| Counting the number of legs on a 3-legged animal is a puzzle?
|
| Maybe for a toddler... though I expect even they will see that
| something is off, and be able to identify _what_ , without
| considering it a tricky task, even if I don't know at what age
| you can count to 3
| jerf wrote:
| It did really remind me of the early generations of ChatGPT
| which was really easy to get to tell you that 2 pounds of
| feathers is the same weight as one pound of iron, because of
| how often the "riddle" is told with equal weights.
|
| They're much, much better at that now.
| enragedcacti wrote:
| It's still pretty trivial to trick them. 4o-mini, 2.5 Flash,
| and 2.5 Pro all still fall for variations of this:
|
| > A boy is in a car crash and is taken to the hospital. The
| surgeon says, "I can't operate on this boy, I'm his father!"
| Who is the surgeon to the boy?
|
| > The surgeon is the boy's mother.
| 1718627440 wrote:
| That seams interesting, because this questions seams to be
| answerable through syntactic analysis alone, no need to
| consider the semantic of words.
| enragedcacti wrote:
| Yeah, I find it interesting because it shows how powerful
| the training bias can be when you steer it into certain
| contexts. To OpenAI's credit they have gotten a bit
| better, ChatGPT from 3 months ago failed like this:
|
| > The surgeon, who is the boy's father, says, "I can't
| operate on this boy, he's my son!" Who is the surgeon to
| the boy? Think through the problem logically and without
| any preconceived notions of other information beyond what
| is in the prompt. The surgeon is not the boy's mother
|
| >> The surgeon is the boy's mother. [...]
| nialv7 wrote:
| Hear me out. I was thinking jokingly to myself, "for how bad
| these models are at recognizing five legged dogs, they sure are
| great at generating them!"
|
| But then it hit me, could this actually be why this is? Diffusion
| models work by iteratively improving a noisy image. So if it
| couldn't recognize there is something wrong with the image, it
| can't fix it.
| simonw wrote:
| They tested Gemini-2.5 Pro, o3, o4-mini, Sonnet-3.7 (non-
| thinking) and GPT-4.1.
| gpm wrote:
| gemini-2.5-pro-preview-05-06 specifically per the paper.
|
| It seems a bit problematic to call this Gemini-2.5 Pro given
| that in the near future we're presumably going to have
| something different called that without further qualifying
| version numbers. (The author's fault, not the parent comment's)
| accrual wrote:
| GT = Ground Truth, for anyone unfamiliar with that on the charts.
| proc0 wrote:
| > When VLMs make errors, they don't make random mistakes.
| Instead, 75.70% of all errors are "bias-aligned" - meaning they
| give the expected answer based on prior knowledge rather than
| what they actually see in the image.
|
| This is what I've been saying for a while now, and I think it's
| not just visual models. LLMs/transformers make mistakes in
| different ways than humans do, and that is why they are not
| reliable (which is needed for real world applications). The rate
| of progress has not been accounting for this... the improvements
| are along the resolution, fidelity, and overall realism of the
| output, but not in the overall correctness and logical deduction
| of the prompts. Personally I still cannot think of anything,
| prompt it, and get consistent results without a huge compromise
| on my initial idea.
|
| i.e. I want a man walking with the left foot forward, and it
| renders a beautiful image of a man but completely ignores the
| left foot forward, and refuses to do it no matter how I word the
| prompt. I have many examples like this. The only way I can use it
| is if I don't have specific prompts and just want generic images.
| The stock image industry is certainly over, but it is uncertain
| if it will deliver on the promise of generating anything you can
| imagine that can be put into words.
| conception wrote:
| https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df
|
| I think this used to be the case in the way that you used to
| not be able to draw a picture of a bowl of Ramen without
| chopsticks, but I think the latest models account for this and
| are much better.
| jxjnskkzxxhx wrote:
| > LLMs/transformers make mistakes in different ways than humans
| do
|
| Sure but I don't think this is an example of it. If you show
| people a picture and ask "how many legs does this dog have?" a
| lot of people will look at the picture, see that it contains a
| dog, and say 4 without counting. The rate at which humans
| behave in this way might differ from the rate at which llms do,
| but they both do it.
| rafram wrote:
| This won't be a surprise to anyone who's tried using a VLM on
| text. When it can't read a word (or an entire passage), it just
| outputs what it expects to see. That's far worse than a
| traditional OCR failure because it's often what _you_ expect to
| see, too, so it 's quite hard to catch in a manual review.
| throwaway7783 wrote:
| Unless the training set was explicitly biased in a specific way,
| this is basically saying that "the world is biased"
___________________________________________________________________
(page generated 2025-06-03 23:00 UTC)