[HN Gopher] Are LLMs able to notice the "gorilla in the data"?
       ___________________________________________________________________
        
       Are LLMs able to notice the "gorilla in the data"?
        
       Author : finding_theta
       Score  : 263 points
       Date   : 2025-02-05 16:33 UTC (3 days ago)
        
 (HTM) web link (chiraaggohel.com)
 (TXT) w3m dump (chiraaggohel.com)
        
       | zozbot234 wrote:
       | That's done on purpose. The AI can't easily tell whether the
       | drawing might be intended to be one of a human or a gorilla, so
       | when in doubt it doesn't want to commit either way and just
       | ignores the topic altogether. It's just another example of AI
       | ethics influencing its behavior for alignment purposes.
        
         | 4ndrewl wrote:
         | You mean that whole post could have been written:
         | 
         | "AI can't see gorilla because wokification"?
         | 
         | /s
         | 
         | Edit: Adding /s Thought "wokification" already signalled that.
        
           | badgersnake wrote:
           | You didn't read it either.
        
             | 4ndrewl wrote:
             | I was /s commenting on the comment :) I did read it, could
             | relate to the constant reframing of the questioner to just
             | "look" at the graph. Like talking to a child.
        
         | shermantanktop wrote:
         | What specifically do you base this on? It sounds like
         | conjecture.
        
         | badgersnake wrote:
         | You didn't read the article eh?
         | 
         | You just assumed it's similar to the Google situation a few
         | years back where they banned their classifier from classifying
         | images as gorillas. It isn't.
        
         | deeviant wrote:
         | Um, no. That is even't close to the truth. The AI doesn't
         | "want" anything. It's a statistical prediction process, and it
         | certainly has nothing like the self-reflection you are
         | attributing to it. And the heuristic layers on top of LLMs are
         | even less capable of doing what you are claiming.
        
         | luluthefirst wrote:
         | It's not that at all... Similar drawings of non-humanoid shapes
         | like an ostrich or the map of Europe would have resulted in the
         | exact same 'blindness'.
        
           | mapt wrote:
           | This is not obvious to me, and nor should it be to anyone who
           | didn't program these AIs (and it probably shouldn't be
           | obvious even to the people who did). I think you both should
           | try testing the hypothesis and present your results.
        
         | washadjeffmad wrote:
         | Context (2018): https://www.wired.com/story/when-it-comes-to-
         | gorillas-google...
         | 
         | "Google promised a fix after its photo-categorization software
         | labeled black people as gorillas in 2015. More than two years
         | later, it hasn't found one."
         | 
         | Companies do seem to have developed greater sensitivity to
         | blind spots with diversity in their datasets, so Parent might
         | not be totally out of line to bring it up.
         | 
         | IBM offloaded their domestic surveillance and facial
         | recognition services following the BLM protests when interest
         | by law enforcement sparked concerns of racial profiling and
         | abuse due in part to low accuracy in higher-melanin subjects,
         | and Apple face unlock famously couldn't tell Asians apart.
         | 
         | It's not outlandish to assume that there's been some special
         | effort made to ensure that datasets and evaluation in newer
         | models don't ignite any more PR threads. That's not claiming
         | Google's classification models have anything to do with
         | OpenAI's multimodal models, just that we know that until
         | relatively recently, models from more than one major US company
         | struggled to correctly identify some individuals as
         | individuals.
        
       | johnfn wrote:
       | GPT can't "see" the results of the scatterplot (unless prompted
       | with an image), it only sees the code it wrote. If a human had
       | the same constraints I doubt they'd identify there was a gorilla
       | there.
       | 
       | Take a screenshot of the scatterplot and feed it into multimodal
       | GPT and it does a fine job at identifying it.
       | 
       | EDIT:
       | 
       | Sorry, as a few people pointed out, I missed the part where the
       | author did feed a PNG into GPT. I kind of jumped to conclusions
       | when it worked fine for me. I still maintain that the article's
       | conclusion ("Your AI Can't See Gorillas") is overly broad, given
       | that I had no trouble getting it to see one.
       | 
       | But I wonder why the author had trouble? My suspicion is that AI
       | got stuck on summary statistics because the previous messages in
       | the chat were all about summary statistics.
        
         | KeplerBoy wrote:
         | Does ChatGPT even have access to the raw data points or does it
         | just know the path to some CSV?
         | 
         | The contents of the CSV might be entirely unknown at inference
         | time.
        
         | dartos wrote:
         | Before seeing Claude's response, did you see where the author
         | said
         | 
         | > I asked the model to closely look at the plot, and also
         | uploaded a png of the plot it had generated.
        
           | Kuinox wrote:
           | I sent the plot to ChatGPT 4o. Here is the conversation:
           | 
           | what do you see
           | 
           | ChatGPT said:
           | 
           | This is a scatter plot with the variables "steps" on the
           | x-axis and "bmi" on the y-axis. The data points are colored
           | by "gender" (red for female and blue for male).
           | Interestingly, the arrangement of the points appears to form
           | a drawing resembling a cartoonish figure or character, likely
           | added for artistic or humorous effect.
           | 
           | If you'd like, I can help analyze or discuss the data
           | further!
           | 
           | You said:
           | 
           | Wha the cartoonish figure looks like ?
           | 
           | ChatGPT said:
           | 
           | The cartoonish figure in the plot resembles a stylized
           | gorilla or monkey-like character. It appears to be waving
           | with one hand raised, while the other arm is resting
           | downward. The face is expressive, with distinct eyes, a nose,
           | and a slightly frowning mouth. The overall pose gives it a
           | somewhat playful or cheeky vibe.
        
             | dartos wrote:
             | Sure whatever.
             | 
             | OC seemed to think that Claude did that with just the data
             | and not the image of the scatterplot it's.
        
               | smusamashah wrote:
               | LLM responses are random. One's failure is other's
               | success. When evaluating we all should do rerurns and see
               | how many times it fails or succeeds.
               | 
               | Without number of rerurns, the result is as good as
               | random.
        
               | dartos wrote:
               | Okay?
               | 
               | OC was saying that the article said that Claude
               | recognized the "artistic" lines of the image from just
               | the scatter plot data.
               | 
               | That isn't what happened.
               | 
               | The author added a png of the plot to the conversation.
               | 
               | Idk why I need to explain that twice.
        
           | johnfn wrote:
           | Hm, interesting. The way I tried it was by pasting an image
           | into Claude directly as the start of the conversation, plus a
           | simple prompt ("What do you see here?"). It got the specific
           | image wrong (it thought it was baby yoda, lol), but it did
           | understand that it was an image.
           | 
           | I wonder if the author got different results because they had
           | been talking a lot about a data set before showing the image,
           | which possibly predisposed AI to think that it was a normal
           | data set. In any case, I think that "Your Ai Can't See
           | Gorillas" isn't really a valid conclusion.
        
             | vunderba wrote:
             | Please read TFA. The conclusion of the article isn't nearly
             | so simplistic, they're just suggesting that you have to be
             | aware of the natural strengths and weaknesses of LLMs, even
             | multi modal ones particularly around visual pattern
             | recognition vs quantitative pattern recognition.
             | 
             | And yes, the idea that the initial context can sometimes
             | predispose the LLM to consider things in a more narrow
             | manner than a user might otherwise want is definitely well
             | known.
        
               | johnfn wrote:
               | The title of the article is "Your AI Can't See Gorillas".
               | That seems demonstrably false.
               | 
               | The article says:
               | 
               | > Furthermore, their data analysis capabilities seem to
               | focus much more on quantitative metrics and summary
               | statistics, and less on the visual structure of the data
               | 
               | Again, this seems false - or, at best, misleading. I had
               | no problem getting AI to focus on visual structure of the
               | data without any tricks. A more fair statement would be
               | "If you ask an AI a bunch of questions about summary
               | statistics and then show it a scatterplot with an image,
               | then it might continue to focus on summary statistics".
               | But that's not what the concluding paragraph states, and
               | it's not what the title states, either.
        
               | 8note wrote:
               | you knew that there was a visual gag in there before
               | asking it to.
               | 
               | if you didnt know it was there, and took a look at only
               | the text output, the llm would not have found it to tell
               | you its there
        
         | SequoiaHope wrote:
         | What you refer to as the article's conclusion is in fact the
         | article's title. The article's conclusion (under "Thoughts" at
         | the end) may be well summarized by its first sentence: "As the
         | idea of using LLMs/agents to perform different scientific and
         | technical tasks becomes more mainstream, it will be important
         | to understand their strengths and weaknesses."
         | 
         | The conclusion is quite reasonable and the article was IMO well
         | written. It shares details of an experiment and then provides a
         | thoughtful analysis. I don't believe the analysis is overly
         | broad.
        
       | Retr0id wrote:
       | I'm not sure I'd be able to tell it was supposed to be a gorilla
       | specifically, without context.
        
         | catlifeonmars wrote:
         | I think that's part of the authors point. The article starts
         | out by explaining a human phenomenon and then extending it to
         | LLMs
        
           | Retr0id wrote:
           | Humans only failed to spot it when prompted in a way that was
           | misdirective, though.
        
             | catlifeonmars wrote:
             | True. What I think is missing (and probably the more
             | interesting question) is an analysis on _why_ LLMs failed
             | to spot it. I imagine it has something to do with the model
             | architecture.
        
       | amelius wrote:
       | Can it draw the unicorn yet?
       | 
       | https://gpt-unicorn.adamkdean.co.uk/
        
         | GaggiX wrote:
         | Claude 3.5 Sonnet is much better at it:
         | https://claude.site/artifacts/ad1b544f-4d1b-4fc2-9862-d6438e...
         | 
         | But I guess GPT-4o results are more funny to look at.
        
         | hwillis wrote:
         | I wondered if o1 would do better- seems reasonable that step-
         | by-step trying to produce legs/torso/head/horn would do better
         | than very weird legless things 4o is making. Looks like someone
         | has done it: https://openaiwatch.com/?model=o1-preview
         | 
         | They do seem to generally have legs and head, which is an
         | improvement over 4o. Still pretty unimpressive.
        
           | throwaway314155 wrote:
           | Why not o3-mini?
        
       | GaggiX wrote:
       | If you give the graph as image to the model they will easily see
       | the monkey: "I see a drawing of a monkey outlined with red and
       | blue dots.", if you give them coordinates than they will much
       | more struggle with it like a human would do.
        
         | badgersnake wrote:
         | Nope, any human when asked to plot that data would pretty
         | quickly give up and (correctly) assume it was a wind-up.
        
           | GaggiX wrote:
           | They still don't see the gorilla anyway ahah
        
       | sw1sh wrote:
       | I got "The scatter plot appears to be arranged to resemble the
       | character "Pepe the Frog," a popular internet meme ... " lol
       | 
       | Not sure whether multimodal embeddings have such a good pattern
       | recognition accuracy in this case, probably most of information
       | goes into attending to plot related features, like its labels and
       | ticks.
        
       | mkoubaa wrote:
       | Asimov forgot to warn us of Artificial Stupidity
        
         | mitthrowaway2 wrote:
         | I see you haven't read enough Asimov yet.
        
       | runjake wrote:
       | Only tangentially related to this story, I've been trying for
       | months to train the YOLO models to recognize my Prussian blue
       | cat, with its assorted white spots, as a cat rather than a dog or
       | a person.
       | 
       | However, it refuses to cooperate. It's maddening.
       | 
       | As a result, I receive "There is a person at your front door"
       | notifications at all hours of the night.
        
         | GaggiX wrote:
         | Something is very wrong if the model cannot tell the difference
         | between a Prussian blue cat and a person. I imagine you have
         | inserted in training data the images of the cat from the camera
         | and in similar quantities of a person from the same camera.
        
         | duxup wrote:
         | Even beyond training I asked an llm to generate me an image of
         | a corndog. It would only give me hot dogs until I described how
         | a corndog is made.
         | 
         | Not the end of the world but it does seem like AI gets fixated,
         | like people, and can't see anything else.
        
         | nozzlegear wrote:
         | I've had a Nest camera in my living room for years just to keep
         | an eye on our dogs while we're away from home. One of the dogs,
         | a basset hound/border collie mix, often howls and makes
         | "squeeing" noises while we're away. Nest (or Google now, I
         | suppose) without fail thinks that this is actually a person
         | talking in my living room and sends us notifications alerting
         | us to this fact. If he moves around, Nest thinks it's a person
         | moving in my living room.
         | 
         | It has no problem identifying our other two dogs as actual dogs
         | who bark and move like dogs.
        
         | stevenpetryk wrote:
         | Obviously you need to place a fiducial mark on your cat.
        
         | polygot wrote:
         | I'd like to see a photo of said cat, for science of course :)
        
         | nopakos wrote:
         | I have a Tapo camera which from all the cats parading from my
         | yard, only one black cat is recognized as a person. Not other
         | cats. Not even other black cats. It makes you think.
        
           | nenaoki wrote:
           | Maybe A Whisker Away is actually a documentary...
        
         | coder543 wrote:
         | How exactly are you trying to train and deploy this YOLO model?
         | What kind of accuracy are you seeing against the validation set
         | at the end of the training process?
        
       | ultimoo wrote:
       | reminded me of classic attention test
       | https://m.youtube.com/watch?v=vJG698U2Mvo
        
       | talles wrote:
       | Is "seeing the gorilla" a reference borrowed from this work?
       | https://www.youtube.com/watch?v=UtKt8YF7dgQ
        
       | lxe wrote:
       | If you give a blind researcher this task, they might have trouble
       | seeing the gorillas as well.
       | 
       | Also the prompt matters. To a human, literally everything they
       | see and experience is "the prompt", so to speak. A constant
       | barrage of inputs.
       | 
       | To the AI, it's just the prompt and the text it generates.
        
       | wodenokoto wrote:
       | I love that gorilla test. Happens in my team all the time, that
       | people start with the assumption that the data is "good" and then
       | deep dive.
       | 
       | Is there a blog post that just focus on the gorilla test that I
       | can share with my team? I'm not even interested in the LLM part
        
         | hammock wrote:
         | Same here. Can't count the number of times I've had to come in
         | and say "hold on, you built an entire report with conclusions
         | and recommendations but didn't stop to say hmm this data looks
         | weird and dig into validation?" "We assumed the data was right
         | and that it must be xyz..."
         | 
         | A corollary if this that is my personal pet peeve is
         | attributing everything you can't explain to "seasonality" ,
         | that is such a crutch. If you can't explain it then just say
         | that. There is a better than not chance it is noise anyway.
        
           | ben_w wrote:
           | > A corollary if this that is my personal pet peeve is
           | attributing everything you can't explain to "seasonality" ,
           | that is such a crutch. If you can't explain it then just say
           | that. There is a better than not chance it is noise anyway.
           | 
           | Very early in my career, I discovered python's FFT libraries,
           | and thought I was being clever when plugging in satellite
           | data and getting a strong signal.
           | 
           | Until I realised I'd found "years".
        
           | 8n4vidtmkvmk wrote:
           | > attributing everything you can't explain to "seasonality"
           | 
           | Is this a literal thing or figurative thing? Because it
           | should be very easy to see the seasons if you have a few
           | years of data.
           | 
           | I just attribute all the data I don't like to noise :-)
        
             | xboxnolifes wrote:
             | Just because something happens on a yearly cadence doesn't
             | mean that "seasonality" is a good reasoning. It's just
             | restating that it happens on a yearly cadence, it doesn't
             | actually explain why it happens.
        
             | hammock wrote:
             | Presented as a literal thing, but is really figurative.
             | What I mean is, often you don't have the data to actually
             | plot/PCA/whatever technique the seasonality, maybe you only
             | have 2 years of data for example. But it dips in October
             | and you have no idea why so you just say "Q4 tends to be a
             | low season for this product" or something equally dubious,
             | with no further analysis or hope of same
        
           | s1mplicissimus wrote:
           | I share this experience of people often just performing the
           | steps without thinking about the meaning behind them. In data
           | analysis and software development.
           | 
           | My conclusion so far has been "well they are not doing their
           | job properly".
           | 
           | I assume that's the kinds of jobs LLM's can replace: People
           | you don't want on your payroll anyway
        
       | albert_e wrote:
       | Recently we read about how DeepSeek reasoning models exhibited a
       | "Aha! moment" when analyzing a complex problem, where they find a
       | deeper pattern/insight that provides a breakthrough.
       | 
       | I feel we also need models to be able to have a " Wait, What?"
       | moment
        
       | forgotusername6 wrote:
       | I had a recent similar experience with chat gpt and a gorilla. I
       | was designing a rather complicated algorithm so I wrote out all
       | the steps in words. I then asked chatgpt to verify that it made
       | sense. It said it was well thought out, logical etc. My colleague
       | didn't believe that it was really reading it properly so I
       | inserted a step in the middle "and then a gorilla appears" and
       | asked it again. Sure enough, it again came back saying it was
       | well thought out etc. When I questioned it on the gorilla, it
       | merely replied saying that it thought it was meant to be there,
       | that it was a technical term or a codename for something...
        
         | sigmoid10 wrote:
         | >it thought it was meant to be there, that it was a technical
         | term or a codename for something
         | 
         | That's such a classical human behaviour in technical
         | discussions, I wouldn't even be mad. I'm more surprised that
         | picked up on that behaviour from human generated datasets. But
         | I suppose that's what you get from scraping places like
         | Stackoverflow and HN.
        
           | ben_w wrote:
           | I'm reminded of one of the earlier anecdotes from OpenAI
           | about fine-tuning -- to paraphrase:
           | 
           | > This writer fine tuned on all their slack messages, then
           | asked it to write a blog post. It replied "Sure, I'll do it
           | tomorrow"
           | 
           | > Then he said "No, do it now", and it replied "OK, sure
           | thing" and did nothing else.
        
         | CamperBob2 wrote:
         | This is _literally_ how human brains work:
         | https://www.npr.org/2010/05/19/126977945/bet-you-didnt-notic...
        
           | ben_w wrote:
           | Given I knew what the test was before seeing one of these
           | videos (yes, there is more than one), I find it extra weird
           | that I still didn't see the gorilla the first time.
           | 
           | Now, I see it every time.
        
         | izackp wrote:
         | I typically tell it that there at 5 problems in the logic.
         | Summarize the steps, why it's necessary, and what typically
         | comes after that step. Then please list and explain all five
         | errors.
        
         | WesolyKubeczek wrote:
         | "The ball bearings are to be made of wood, because no one is
         | going to read this work this far anyway."
        
           | Karellen wrote:
           | And a bowl of M&Ms, with all the brown ones taken out - to
           | make sure they did read this far.
        
             | mech422 wrote:
             | I still think that is such a simple, easy litmus test...
             | genius :-)
        
         | jancsika wrote:
         | Just imagining an episode of Star Trek where the inhabitants of
         | a planet have been failing to progress in warp drive tech for
         | several generations. The team beams down to discover that
         | society's tech stopped progressing when they became addicted to
         | pentesting their LLM for intelligence, only to then immediately
         | patch the LLM in order to pass each particular pentest that it
         | failed.
         | 
         | Now the society's time and energy has shifted from general
         | scientific progress to gaining expertise in the growing
         | patchset used to rationalize the theory that the LLM possesses
         | intelligence.
         | 
         | The plot would turn when Picard tries to wrest a phasor from a
         | rogue non-believer trying to assassinate the Queen, and the
         | phasor accidentally fires and ends up frying the entire LLM
         | patchset.
         | 
         | Mr. Data tries to reassure the planet's forlorn inhabitants, as
         | they are convinced they'll _never_ be able to build the warp
         | drive now that the LLM patchset is gone. But when he asks them
         | why their prototypes never worked in the first place, one by
         | one the inhabitants begin to speculate and argue about the
         | problems with their warp drive 's design and build.
         | 
         | The episode ends with Data apologizing to Picard since he seems
         | to have started a conflict among the inhabitants. However,
         | Picard points Mr. Data to one of the engineers drawing out a
         | rocket test on a whiteboard. He then thanks him for potentially
         | spurring on the planet's next scientific revolution.
         | 
         | Fin
        
           | sillysaurusx wrote:
           | There actually is an episode of TNG similar to that. The
           | society stopped being able to think for themselves, because
           | the AI did all their thinking for them. Anything the AI
           | didn't know how to do, they didn't know how to do. It was in
           | season 1 or season 2.
        
             | marcosdumay wrote:
             | The difference is that on that episode, the AI was actually
             | capable of thinking.
             | 
             | Asimov has an story like that too.
        
               | fossuser wrote:
               | The Asimov story it reminded me of was The Profession,
               | though that one is not really about AI - but it is about
               | original ideas and the kinds of people that have them.
               | 
               | I find the LLM dismissals somewhat tedious for most of
               | the people making them half of humanity wouldn't meet
               | their standards.
        
             | fossuser wrote:
             | The Machine Stops also touches on a lot of these ideas and
             | was written in 1909!
             | 
             | --
             | 
             | "The story describes a world in which most of the human
             | population has lost the ability to live on the surface of
             | the Earth. Each individual now lives in isolation below
             | ground in a standard room, with all bodily and spiritual
             | needs met by the omnipotent, global Machine. Travel is
             | permitted but is unpopular and rarely necessary.
             | Communication is made via a kind of instant messaging/video
             | conferencing machine with which people conduct their only
             | activity: the sharing of ideas and what passes for
             | knowledge.
             | 
             | The two main characters, Vashti and her son Kuno, live on
             | opposite sides of the world. Vashti is content with her
             | life, which, like most inhabitants of the world, she spends
             | producing and endlessly discussing second-hand 'ideas'. Her
             | son Kuno, however, is a sensualist and a rebel. He
             | persuades a reluctant Vashti to endure the journey (and the
             | resultant unwelcome personal interaction) to his room.
             | There, he tells her of his disenchantment with the
             | sanitised, mechanical world. He confides to her that he has
             | visited the surface of the Earth without permission and
             | that he saw other humans living outside the world of the
             | Machine. However, the Machine recaptures him, and he is
             | threatened with 'Homelessness': expulsion from the
             | underground environment and presumed death. Vashti,
             | however, dismisses her son's concerns as dangerous madness
             | and returns to her part of the world.
             | 
             | As time passes, and Vashti continues the routine of her
             | daily life, there are two important developments. First,
             | individuals are no longer permitted use of the respirators
             | which are needed to visit the Earth's surface. Most welcome
             | this development, as they are sceptical and fearful of
             | first-hand experience and of those who desire it. Secondly,
             | "Mechanism", a kind of religion, is established in which
             | the Machine is the object of worship. People forget that
             | humans created the Machine and treat it as a mystical
             | entity whose needs supersede their own.
             | 
             | Those who do not accept the deity of the Machine are viewed
             | as 'unmechanical' and threatened with Homelessness. The
             | Mending Apparatus--the system charged with repairing
             | defects that appear in the Machine proper--has also failed
             | by this time, but concerns about this are dismissed in the
             | context of the supposed omnipotence of the Machine itself.
             | 
             | During this time, Kuno is transferred to a room near
             | Vashti's. He comes to believe that the Machine is breaking
             | down and tells her cryptically "The Machine stops." Vashti
             | continues with her life, but eventually defects begin to
             | appear in the Machine. At first, humans accept the
             | deteriorations as the whim of the Machine, to which they
             | are now wholly subservient, but the situation continues to
             | deteriorate as the knowledge of how to repair the Machine
             | has been lost.
             | 
             | Finally, the Machine collapses, bringing 'civilization'
             | down with it. Kuno comes to Vashti's ruined room. Before
             | they both perish, they realise that humanity and its
             | connection to the natural world are what truly matters, and
             | that it will fall to the surface-dwellers who still exist
             | to rebuild the human race and to prevent the mistake of the
             | Machine from being repeated."
             | 
             | https://en.wikipedia.org/wiki/The_Machine_Stops
        
             | goatlover wrote:
             | Isn't that somewhat the background of Dune? That there was
             | a revolt against thinking machines because humans had
             | become too dependent on them for thinking. So humans ended
             | up becoming addicted to The Spice instead.
        
               | duskwuff wrote:
               | > That there was a revolt against thinking machines...
               | 
               | Yes...
               | 
               | > ...because humans had become too dependent on them for
               | thinking.
               | 
               | ... but no. The causes of the Butlerian Jihad are
               | forgotten (or, at least, never mentioned) in any of Frank
               | Herbert's novels; all that's remembered is the outcome.
        
           | jimkoen wrote:
           | That's actually a moderately decent pitch for an episode.
        
             | teaearlgraycold wrote:
             | Strange New Worlds might still accept fan scripts. Star
             | Trek used to be famous for their willingness to produce
             | episodes based on mailed in scripts.
        
           | LtWorf wrote:
           | This has more plot than all the seasons of star trek picard
           | together :D
        
             | Ygg2 wrote:
             | Season 3 wasn't that bad.
        
         | cruffle_duffle wrote:
         | Not to troubleshoot but unless you visually inspected the
         | context that was provided to the model it is quite possible it
         | never even had your change pulled in.
         | 
         | Lots of front ends will do tricks like partially loading the
         | file or using a cached version or some other behavior. Plus if
         | you presented the file to the same "thread" it is possible it
         | got confused about which to look at.
         | 
         | These front ends do a pretty lousy job of communicating to you,
         | the end user, precisely what they are pulling into the models
         | context window at any given time. And what the model sees as
         | its full context window might change during the conversation as
         | the "front end" makes edits to part portions of the same
         | session (like dropping large files it pulled in earlier that it
         | determines aren't relevant somehow).
         | 
         | In short what you see might not be what the model is seeing at
         | all, thus it not returning the results you expect. Every front
         | end plays games with the context it provides to the model in
         | order to reduce token counts and improve model performance
         | (however "performance gets defined and measured by the
         | designers)
         | 
         | That all being said it's also completely possible it missed the
         | gorilla in the middle... so who really knows eh?
        
         | phyzome wrote:
         | << saying that it thought >>
         | 
         | This is, quite literally, not something that ChatGPT has the
         | capability to do -- reporting on its thought process, that is.
         | This is a hallucination.
        
       | cjbgkagh wrote:
       | Seems like the specific goal post of gorilla was chosen in order
       | to obtain the outcome to write the paper they wanted and rather
       | uninteresting compared to determining at what point does the AI
       | start to see shapes in the data. Could the AI see a line, curve,
       | square, or an umbrella? If AI can't see a square why would we
       | expect it to see a gorilla?
        
       | tonetegeatinst wrote:
       | "The core value of EDA..."
       | 
       | Another subtle joke about chip design and layout strikes again.
        
       | svilen_dobrev wrote:
       | is this the opposite of people seeing/searching for dicks here or
       | there ?
        
       | shortrounddev2 wrote:
       | Do this in reverse and ask it to generate ascii art for you
        
       | mrbonner wrote:
       | Is it just me thinking that we are officially in the new
       | territory of trolling LLMs?
        
       | mariofilho wrote:
       | I uploaded the image to Gemini 2.0 Flash Thinking 01 21 and
       | asked:
       | 
       | " Here is a steps vs bmi plot. What do you notice?"
       | 
       | Part of the answer:
       | 
       | "Monkey Shape: The most striking feature of this plot is that the
       | data points are arranged to form the shape of a monkey. This is
       | not a typical scatter plot where you'd expect to see trends or
       | correlations between variables in a statistical sense. Instead,
       | it appears to be a creative visualization where data points are
       | placed to create an image."
       | 
       | Gemini 2.0 Pro without thinking didn't see the monkey
        
         | martinsnow wrote:
         | It thought my bald colleague was a plant in the background. So
         | don't have high hopes for it. He did wear a headset so that is
         | apparently very plant like.
        
           | ffsm8 wrote:
           | Maybe he's actually a spy
        
           | jagged-chisel wrote:
           | Wrong kind of plant. See sibling comment.
        
           | memhole wrote:
           | Favorite thing recently has been using the vision models to
           | make jokes. Sometimes non sequiturs get old, but occasionally
           | you hit the right one that's just hilarious. It's like
           | monster rancher for jokes.
           | 
           | https://en.wikipedia.org/wiki/Monster_Rancher
        
         | wyldfire wrote:
         | That doesn't seem like an appropriate comparison to the task
         | the blogger did. The blogger gave their AI thing the raw data -
         | and a different prompt from the one you gave. If you gave it a
         | raster image, that's "cheating" - these models were trained to
         | recognize things in images.
        
           | PunchTornado wrote:
           | In the article they also give the png
        
             | wyldfire wrote:
             | > When a png is directly uploaded, the model is better able
             | to notice that some strange pattern is present in the data.
             | However, it still does not recognize the pattern as a
             | gorilla.
             | 
             | I wonder if the conversation context unfairly weighed the
             | new impression towards the previous interpretation.
        
       | notnmeyer wrote:
       | maybe i dont get it, but can we conclusively say that the gorilla
       | wasn't "seen" vs. deemed to be irrelevant to the questions being
       | asked?
       | 
       | "look at the scatter plot again" is anthropomorphizing the llm
       | and expecting it to infer a fairly odd intent.
       | 
       | would queries like, "does the scatter plot visualization look
       | like any real world objects?" may have produced a result the
       | author was fishing for.
       | 
       | if it were the opposite situation and you were trying to answer
       | "real" questions and the llm was suggesting, "the data is
       | visualized looks like notorious big" we'd all be here laughing at
       | a different post about the dumb llm.
        
         | comex wrote:
         | If you were trying to answer real questions, you'd want to know
         | if there were clear signs of the data being fake, flawed, or
         | just different-looking than expected, potentially leading to
         | new hypotheses.
         | 
         | The gorilla is just an extreme example of that.
         | 
         | Albeit perhaps an unfair example when applied to AI.
         | 
         | In the original experiment with humans, the assumption seemed
         | to be that the gorilla is fundamentally easy to see. Therefore
         | if you look at the graph to try to find patterns in it, you
         | ought to notice the gorilla. If you don't notice it, you might
         | also fail to notice other obvious patterns that would be more
         | likely to occur in real data.
         | 
         | Even for humans, that assumption might be incorrect. To some
         | extent, failing to notice the gorilla might just be
         | demonstrating a quirk in our brains' visual processing. If we
         | expect data, we see data, no matter how obvious the gorilla
         | might be. Failing to notice the gorilla doesn't necessarily
         | mean that we'd also fail to notice the sorts of patterns or
         | flaws that appear in real data. But on the other hand, people
         | _do_ often fail to notice 'obvious' patterns in real data. To
         | distinguish the two effects, you'd want a larger experiment
         | with more types of 'obvious' flaws than just gorillas.
         | 
         | For AI, those concerns are the same but magnified. On one hand,
         | vision models are so alien that it's entirely plausible they
         | can notice patterns reliably despite not seeing the gorilla. On
         | the other hand, vision models are so unreliable that it's also
         | plausible they can't notice patterns in graphs well at all.
         | 
         | In any case, for both humans and AI, it's interesting what
         | these examples reveal about their visual processing, which is
         | in both cases something of a black box. That makes the gorilla
         | experiment worth talking about regardless of what lessons it
         | does or doesn't hold for real data analysis.
        
       | zmgsabst wrote:
       | But both models did see the gorilla when prompted with it...?
       | 
       | ChatGPT:
       | 
       | > It looks like the scatter plot unintentionally formed an
       | artistic pattern rather than a meaningful representation of the
       | data.
       | 
       | Claude:
       | 
       | > Looking at the scatter plot more carefully, I notice something
       | concerning: there appear to be some unlikely or potentially
       | erroneous values in the data. Let me analyze this in more detail.
       | 
       | > Ah, now I see something very striking that I missed in my
       | previous analysis - there appears to be a clear pattern in the
       | data points that looks artificial. The data points form distinct
       | curves and lines across the plot, which is highly unusual for
       | what should be natural, continuous biological measurements.
       | 
       | Given the context of asking for quantitative analysis and their
       | general beaten-into-submission attitude where they defer to you,
       | eg, your assertion this is a real dataset... I'm not sure what
       | conclusion we're supposed to draw.
       | 
       | That if you lie to the AI, it'll believe you...?
       | 
       | Neither was prompted that this is potentially adversarial data --
       | and AI don't generally infer social context very well. (A similar
       | effect occurs with math tests.)
        
       | hinkley wrote:
       | Boring.
       | 
       | I don't even like AI and I still will tell you this whole premise
       | is bullshit.
       | 
       | ChatGPT got
       | 
       | > It looks like the scatter plot unintentionally formed an
       | artistic pattern rather than a meaningful representation of the
       | data.
       | 
       | Claude drew a scatter plot with points that are so fat that it
       | doesn't look like a gorilla. It looks like two graffiti artists
       | fighting over drawing space.
       | 
       | It's a resolution problem.
       | 
       | What happens if you give Claude the picture ChatGPT generated?
        
       | wyldfire wrote:
       | If you give it stronger hints could it figure it out? "imagine
       | the data plot is a raster image. what is pictured?"
        
       | silverkiwi wrote:
       | The evolution from LLM to Reasoning is simply multi pass or
       | recursive questioning.
       | 
       | What's missing in the terminology is the modality- most often
       | TEXT.
       | 
       | So really we on have Test LLM or Text Reasoning models at the
       | moment.
       | 
       | Your example illustrates the benefits of Multi Modal Reasoning
       | (using multiple modality with multi pass)
       | 
       | Good news - this is coming (I'm working on it). Bad news this
       | massively increases the compute as each pass now has to interact
       | with each modality. Unless the LLM is fully multi modal (Some
       | are) - this now forces the multipass questions to accommodate.
       | The number of extra possible paths massively increases. Hopefully
       | we stumble across a nice solution. But the level of complexity
       | massively increases with each additional modality
       | (text,audio,images, video etc)
        
       | ben_w wrote:
       | On the one hand, this is very human behaviour, both literally and
       | in general.
       | 
       | Literally, because this is why the Datasaurus dozen was created:
       | https://en.wikipedia.org/wiki/Datasaurus_dozen
       | 
       | Metaphorically, because of all the times (including here, on this
       | very article :P) where people comment on the basis of the
       | headline rather than reading a story.
       | 
       | On the other hand, this isn't the bit of human cognition we
       | should be trying to automate, it's the bit we should be using AI
       | to overcome.
        
       | meltyness wrote:
       | Does anyone know if tokenizers are pruned? That is, if a token
       | doesn't appear in the corpus is it removed from the model? That
       | would imply a process that leaks information about the dataset.
        
       | appleorchard46 wrote:
       | These posts about X task LLMs fails at when you give it Y prompt
       | are getting more and more silly.
       | 
       | If you ask an AI to analyze some data, should the _default
       | behavior_ be to use that data to make various types of graphs,
       | export said graphs, feed them back in to itself, then analyze the
       | shapes of those graphs to see if they resemble an animal?
       | 
       | Personally I would be very annoyed if I actually wanted a
       | statistical analysis, and it spent a bajillion tokens following
       | the process above in order to tell me my data looks like a
       | chicken when you tip it sideways.
       | 
       | > However, this same trait makes them potentially problematic for
       | exploratory data analysis. The core value of EDA lies in its
       | ability to generate novel hypotheses through pattern recognition.
       | The fact that both Sonnet and 4o required explicit prompting to
       | notice even dramatic visual patterns suggests they may miss
       | crucial insights during open-ended exploration.
       | 
       | It requires prompting for x if you want it to do x... That's a
       | feature, not a bug. Note that no mention of open-ended
       | exploration or approaching the data from alternate perspectives
       | was made in the original prompt.
        
         | debeloo wrote:
         | I have to agree with this.
         | 
         | Try sending this graph to an actual human analyst. His
         | response, after you paying him will probably be to cut off any
         | further business relationship with you.
        
         | amarshall wrote:
         | I think it depends if one is using "AI" as a tool or as a
         | replacement for an intelligent expert? The former, sure, it's
         | maybe not expected, because the prompter _is_ already an
         | intelligent expert. If the latter, then yes, I think, because
         | if you gave the task to an expect and they did not notice this,
         | I would consider them not good at their job. See also Anscombe
         | 's quartet[1] and the Datasaurus dozen[2] (mentioned in another
         | comment as well).
         | 
         | [1]: https://en.wikipedia.org/wiki/Anscombe's_quartet [2]:
         | https://en.wikipedia.org/wiki/Datasaurus_dozen
        
           | appleorchard46 wrote:
           | This is true, but I would replace 'intelligent expert' with
           | 'intelligent _human_ expert '.
           | 
           | Graphing data to analyze it - and then seeing shapes and
           | creatures in said graph - is a distinctly human practice, and
           | not an inherently necessary part of most data analysis (the
           | obvious exception being when said data draws a picture).
           | 
           | I think it's because the interface uses human language that
           | people expect AI to make the same assumptions and follow the
           | same processes as humans. In some ways it does, in other ways
           | it doesn't.
        
         | Dylan16807 wrote:
         | You don't think "Examine the data" and "Which other conclusions
         | can you draw from the data?" are open-ended?
         | 
         | And even when explicitly prompted to look at the plot, they
         | only brush up against the data anomalies rather than properly
         | analyzing the plot.
        
         | isaacremuant wrote:
         | I benchmark many of these things as "what would I want a human
         | assistant to do" if they were had insta speed and noticing the
         | pattern would definitely be warranted to determined if data
         | could be falsely generated.
         | 
         | It's not silly at all.
        
       | hollownobody wrote:
       | AFAIK, these models can't "look" at the plots they build. So it
       | is necessary to send the screenshots, otherwise they will never
       | notice the gorilla.
        
       | mmanfrin wrote:
       | This is akin to giving it a photo of the stars and asking it what
       | it sees. If you want to bake pareidolia in to LLMs prepare to pay
       | 100x for your requests.
        
       | lovasoa wrote:
       | I tried passing just the plot to several models:
       | ChatGPT (4o):      Noticed "a pattern"         Le Chat (Mistral):
       | Noticed a "cartoonish figure"           DeepSeek (R1):
       | Completely missed it         Claude:            Completely missed
       | it         Gemini 2.0 Flash:  Completely missed it         Gemini
       | 2.0 Flash Thinking: Noticed "a monkey"
        
       | neom wrote:
       | I asked chatgpt (pro) why it thought that it missed it sometimes
       | and not others, and it said when it's presented with a user input
       | it takes time to decide it's approach, sometimes more "teacherly"
       | sometimes more "investigative", if it took the investigative
       | approach, it would read the code line by line, if it took a
       | teacherly approach, it would treat it as a statistical
       | interpretation exercise.
        
       | _zamorano_ wrote:
       | Someday, an LLM will send every human an 'obvious' pattern (maybe
       | a weird protein or something like that) and we'll all fail to
       | notice and that day Skynet decides it no longer has a use for us
        
         | s1mplicissimus wrote:
         | Maybe. But before I start worrying about that I'll have to see
         | them count correctly or not be fooled by variations of simple
         | riddles
        
       | dragoncrab wrote:
       | How do LLMs calculate statistic metrics like average or standard
       | deviation accurately in such experiments?
        
       | orbital-decay wrote:
       | This doesn't seem to make sense. Can a human spot a gorilla in a
       | sequence of numbers? Try it. Later on, he gives it a picture and
       | it correctly spots the mistake.
       | 
       |  _> but does not specifically understand the pattern as a
       | gorilla_
       | 
       | Maybe it does, how could you tell? Do you really expect an
       | assistant to say "Holy shit, there's a gorilla in your plot!"?
       | The only thing relevant to the request is that the data seems
       | fishy, and it outputs exactly this. Maybe something trained for
       | creative writing, agency, character, and witty remarks (like
       | Claude 3 Opus) would be inclined to do that, and that would be
       | amusing, but that's pretty optional for the presented task.
        
       | PunchTornado wrote:
       | Why do these articles test only 2 models, not even the best ones
       | there are, and generalise to all AIs?
        
       | rsanek wrote:
       | Humans seem to me to be just as likely to make these kinds of
       | errors in the general case. See the classic
       | https://www.youtube.com/watch?v=vJG698U2Mvo, which has an
       | interesting parallel with this paper.
        
       | Nihilartikel wrote:
       | Why aren't we training llms to load the data into
       | R/pandas/polars/duckdb and then interrogate it iteratively that
       | way?
       | 
       | It's how I do it. Why not our pet llm?
        
       | 2-3-7-43-1807 wrote:
       | if we look for agi, then not noticing the gorilla might be a good
       | thing. i'm referring to the gorilla counting experiment.
        
       ___________________________________________________________________
       (page generated 2025-02-08 23:00 UTC)