[HN Gopher] Are LLMs able to notice the "gorilla in the data"?
___________________________________________________________________
Are LLMs able to notice the "gorilla in the data"?
Author : finding_theta
Score : 263 points
Date : 2025-02-05 16:33 UTC (3 days ago)
(HTM) web link (chiraaggohel.com)
(TXT) w3m dump (chiraaggohel.com)
| zozbot234 wrote:
| That's done on purpose. The AI can't easily tell whether the
| drawing might be intended to be one of a human or a gorilla, so
| when in doubt it doesn't want to commit either way and just
| ignores the topic altogether. It's just another example of AI
| ethics influencing its behavior for alignment purposes.
| 4ndrewl wrote:
| You mean that whole post could have been written:
|
| "AI can't see gorilla because wokification"?
|
| /s
|
| Edit: Adding /s Thought "wokification" already signalled that.
| badgersnake wrote:
| You didn't read it either.
| 4ndrewl wrote:
| I was /s commenting on the comment :) I did read it, could
| relate to the constant reframing of the questioner to just
| "look" at the graph. Like talking to a child.
| shermantanktop wrote:
| What specifically do you base this on? It sounds like
| conjecture.
| badgersnake wrote:
| You didn't read the article eh?
|
| You just assumed it's similar to the Google situation a few
| years back where they banned their classifier from classifying
| images as gorillas. It isn't.
| deeviant wrote:
| Um, no. That is even't close to the truth. The AI doesn't
| "want" anything. It's a statistical prediction process, and it
| certainly has nothing like the self-reflection you are
| attributing to it. And the heuristic layers on top of LLMs are
| even less capable of doing what you are claiming.
| luluthefirst wrote:
| It's not that at all... Similar drawings of non-humanoid shapes
| like an ostrich or the map of Europe would have resulted in the
| exact same 'blindness'.
| mapt wrote:
| This is not obvious to me, and nor should it be to anyone who
| didn't program these AIs (and it probably shouldn't be
| obvious even to the people who did). I think you both should
| try testing the hypothesis and present your results.
| washadjeffmad wrote:
| Context (2018): https://www.wired.com/story/when-it-comes-to-
| gorillas-google...
|
| "Google promised a fix after its photo-categorization software
| labeled black people as gorillas in 2015. More than two years
| later, it hasn't found one."
|
| Companies do seem to have developed greater sensitivity to
| blind spots with diversity in their datasets, so Parent might
| not be totally out of line to bring it up.
|
| IBM offloaded their domestic surveillance and facial
| recognition services following the BLM protests when interest
| by law enforcement sparked concerns of racial profiling and
| abuse due in part to low accuracy in higher-melanin subjects,
| and Apple face unlock famously couldn't tell Asians apart.
|
| It's not outlandish to assume that there's been some special
| effort made to ensure that datasets and evaluation in newer
| models don't ignite any more PR threads. That's not claiming
| Google's classification models have anything to do with
| OpenAI's multimodal models, just that we know that until
| relatively recently, models from more than one major US company
| struggled to correctly identify some individuals as
| individuals.
| johnfn wrote:
| GPT can't "see" the results of the scatterplot (unless prompted
| with an image), it only sees the code it wrote. If a human had
| the same constraints I doubt they'd identify there was a gorilla
| there.
|
| Take a screenshot of the scatterplot and feed it into multimodal
| GPT and it does a fine job at identifying it.
|
| EDIT:
|
| Sorry, as a few people pointed out, I missed the part where the
| author did feed a PNG into GPT. I kind of jumped to conclusions
| when it worked fine for me. I still maintain that the article's
| conclusion ("Your AI Can't See Gorillas") is overly broad, given
| that I had no trouble getting it to see one.
|
| But I wonder why the author had trouble? My suspicion is that AI
| got stuck on summary statistics because the previous messages in
| the chat were all about summary statistics.
| KeplerBoy wrote:
| Does ChatGPT even have access to the raw data points or does it
| just know the path to some CSV?
|
| The contents of the CSV might be entirely unknown at inference
| time.
| dartos wrote:
| Before seeing Claude's response, did you see where the author
| said
|
| > I asked the model to closely look at the plot, and also
| uploaded a png of the plot it had generated.
| Kuinox wrote:
| I sent the plot to ChatGPT 4o. Here is the conversation:
|
| what do you see
|
| ChatGPT said:
|
| This is a scatter plot with the variables "steps" on the
| x-axis and "bmi" on the y-axis. The data points are colored
| by "gender" (red for female and blue for male).
| Interestingly, the arrangement of the points appears to form
| a drawing resembling a cartoonish figure or character, likely
| added for artistic or humorous effect.
|
| If you'd like, I can help analyze or discuss the data
| further!
|
| You said:
|
| Wha the cartoonish figure looks like ?
|
| ChatGPT said:
|
| The cartoonish figure in the plot resembles a stylized
| gorilla or monkey-like character. It appears to be waving
| with one hand raised, while the other arm is resting
| downward. The face is expressive, with distinct eyes, a nose,
| and a slightly frowning mouth. The overall pose gives it a
| somewhat playful or cheeky vibe.
| dartos wrote:
| Sure whatever.
|
| OC seemed to think that Claude did that with just the data
| and not the image of the scatterplot it's.
| smusamashah wrote:
| LLM responses are random. One's failure is other's
| success. When evaluating we all should do rerurns and see
| how many times it fails or succeeds.
|
| Without number of rerurns, the result is as good as
| random.
| dartos wrote:
| Okay?
|
| OC was saying that the article said that Claude
| recognized the "artistic" lines of the image from just
| the scatter plot data.
|
| That isn't what happened.
|
| The author added a png of the plot to the conversation.
|
| Idk why I need to explain that twice.
| johnfn wrote:
| Hm, interesting. The way I tried it was by pasting an image
| into Claude directly as the start of the conversation, plus a
| simple prompt ("What do you see here?"). It got the specific
| image wrong (it thought it was baby yoda, lol), but it did
| understand that it was an image.
|
| I wonder if the author got different results because they had
| been talking a lot about a data set before showing the image,
| which possibly predisposed AI to think that it was a normal
| data set. In any case, I think that "Your Ai Can't See
| Gorillas" isn't really a valid conclusion.
| vunderba wrote:
| Please read TFA. The conclusion of the article isn't nearly
| so simplistic, they're just suggesting that you have to be
| aware of the natural strengths and weaknesses of LLMs, even
| multi modal ones particularly around visual pattern
| recognition vs quantitative pattern recognition.
|
| And yes, the idea that the initial context can sometimes
| predispose the LLM to consider things in a more narrow
| manner than a user might otherwise want is definitely well
| known.
| johnfn wrote:
| The title of the article is "Your AI Can't See Gorillas".
| That seems demonstrably false.
|
| The article says:
|
| > Furthermore, their data analysis capabilities seem to
| focus much more on quantitative metrics and summary
| statistics, and less on the visual structure of the data
|
| Again, this seems false - or, at best, misleading. I had
| no problem getting AI to focus on visual structure of the
| data without any tricks. A more fair statement would be
| "If you ask an AI a bunch of questions about summary
| statistics and then show it a scatterplot with an image,
| then it might continue to focus on summary statistics".
| But that's not what the concluding paragraph states, and
| it's not what the title states, either.
| 8note wrote:
| you knew that there was a visual gag in there before
| asking it to.
|
| if you didnt know it was there, and took a look at only
| the text output, the llm would not have found it to tell
| you its there
| SequoiaHope wrote:
| What you refer to as the article's conclusion is in fact the
| article's title. The article's conclusion (under "Thoughts" at
| the end) may be well summarized by its first sentence: "As the
| idea of using LLMs/agents to perform different scientific and
| technical tasks becomes more mainstream, it will be important
| to understand their strengths and weaknesses."
|
| The conclusion is quite reasonable and the article was IMO well
| written. It shares details of an experiment and then provides a
| thoughtful analysis. I don't believe the analysis is overly
| broad.
| Retr0id wrote:
| I'm not sure I'd be able to tell it was supposed to be a gorilla
| specifically, without context.
| catlifeonmars wrote:
| I think that's part of the authors point. The article starts
| out by explaining a human phenomenon and then extending it to
| LLMs
| Retr0id wrote:
| Humans only failed to spot it when prompted in a way that was
| misdirective, though.
| catlifeonmars wrote:
| True. What I think is missing (and probably the more
| interesting question) is an analysis on _why_ LLMs failed
| to spot it. I imagine it has something to do with the model
| architecture.
| amelius wrote:
| Can it draw the unicorn yet?
|
| https://gpt-unicorn.adamkdean.co.uk/
| GaggiX wrote:
| Claude 3.5 Sonnet is much better at it:
| https://claude.site/artifacts/ad1b544f-4d1b-4fc2-9862-d6438e...
|
| But I guess GPT-4o results are more funny to look at.
| hwillis wrote:
| I wondered if o1 would do better- seems reasonable that step-
| by-step trying to produce legs/torso/head/horn would do better
| than very weird legless things 4o is making. Looks like someone
| has done it: https://openaiwatch.com/?model=o1-preview
|
| They do seem to generally have legs and head, which is an
| improvement over 4o. Still pretty unimpressive.
| throwaway314155 wrote:
| Why not o3-mini?
| GaggiX wrote:
| If you give the graph as image to the model they will easily see
| the monkey: "I see a drawing of a monkey outlined with red and
| blue dots.", if you give them coordinates than they will much
| more struggle with it like a human would do.
| badgersnake wrote:
| Nope, any human when asked to plot that data would pretty
| quickly give up and (correctly) assume it was a wind-up.
| GaggiX wrote:
| They still don't see the gorilla anyway ahah
| sw1sh wrote:
| I got "The scatter plot appears to be arranged to resemble the
| character "Pepe the Frog," a popular internet meme ... " lol
|
| Not sure whether multimodal embeddings have such a good pattern
| recognition accuracy in this case, probably most of information
| goes into attending to plot related features, like its labels and
| ticks.
| mkoubaa wrote:
| Asimov forgot to warn us of Artificial Stupidity
| mitthrowaway2 wrote:
| I see you haven't read enough Asimov yet.
| runjake wrote:
| Only tangentially related to this story, I've been trying for
| months to train the YOLO models to recognize my Prussian blue
| cat, with its assorted white spots, as a cat rather than a dog or
| a person.
|
| However, it refuses to cooperate. It's maddening.
|
| As a result, I receive "There is a person at your front door"
| notifications at all hours of the night.
| GaggiX wrote:
| Something is very wrong if the model cannot tell the difference
| between a Prussian blue cat and a person. I imagine you have
| inserted in training data the images of the cat from the camera
| and in similar quantities of a person from the same camera.
| duxup wrote:
| Even beyond training I asked an llm to generate me an image of
| a corndog. It would only give me hot dogs until I described how
| a corndog is made.
|
| Not the end of the world but it does seem like AI gets fixated,
| like people, and can't see anything else.
| nozzlegear wrote:
| I've had a Nest camera in my living room for years just to keep
| an eye on our dogs while we're away from home. One of the dogs,
| a basset hound/border collie mix, often howls and makes
| "squeeing" noises while we're away. Nest (or Google now, I
| suppose) without fail thinks that this is actually a person
| talking in my living room and sends us notifications alerting
| us to this fact. If he moves around, Nest thinks it's a person
| moving in my living room.
|
| It has no problem identifying our other two dogs as actual dogs
| who bark and move like dogs.
| stevenpetryk wrote:
| Obviously you need to place a fiducial mark on your cat.
| polygot wrote:
| I'd like to see a photo of said cat, for science of course :)
| nopakos wrote:
| I have a Tapo camera which from all the cats parading from my
| yard, only one black cat is recognized as a person. Not other
| cats. Not even other black cats. It makes you think.
| nenaoki wrote:
| Maybe A Whisker Away is actually a documentary...
| coder543 wrote:
| How exactly are you trying to train and deploy this YOLO model?
| What kind of accuracy are you seeing against the validation set
| at the end of the training process?
| ultimoo wrote:
| reminded me of classic attention test
| https://m.youtube.com/watch?v=vJG698U2Mvo
| talles wrote:
| Is "seeing the gorilla" a reference borrowed from this work?
| https://www.youtube.com/watch?v=UtKt8YF7dgQ
| lxe wrote:
| If you give a blind researcher this task, they might have trouble
| seeing the gorillas as well.
|
| Also the prompt matters. To a human, literally everything they
| see and experience is "the prompt", so to speak. A constant
| barrage of inputs.
|
| To the AI, it's just the prompt and the text it generates.
| wodenokoto wrote:
| I love that gorilla test. Happens in my team all the time, that
| people start with the assumption that the data is "good" and then
| deep dive.
|
| Is there a blog post that just focus on the gorilla test that I
| can share with my team? I'm not even interested in the LLM part
| hammock wrote:
| Same here. Can't count the number of times I've had to come in
| and say "hold on, you built an entire report with conclusions
| and recommendations but didn't stop to say hmm this data looks
| weird and dig into validation?" "We assumed the data was right
| and that it must be xyz..."
|
| A corollary if this that is my personal pet peeve is
| attributing everything you can't explain to "seasonality" ,
| that is such a crutch. If you can't explain it then just say
| that. There is a better than not chance it is noise anyway.
| ben_w wrote:
| > A corollary if this that is my personal pet peeve is
| attributing everything you can't explain to "seasonality" ,
| that is such a crutch. If you can't explain it then just say
| that. There is a better than not chance it is noise anyway.
|
| Very early in my career, I discovered python's FFT libraries,
| and thought I was being clever when plugging in satellite
| data and getting a strong signal.
|
| Until I realised I'd found "years".
| 8n4vidtmkvmk wrote:
| > attributing everything you can't explain to "seasonality"
|
| Is this a literal thing or figurative thing? Because it
| should be very easy to see the seasons if you have a few
| years of data.
|
| I just attribute all the data I don't like to noise :-)
| xboxnolifes wrote:
| Just because something happens on a yearly cadence doesn't
| mean that "seasonality" is a good reasoning. It's just
| restating that it happens on a yearly cadence, it doesn't
| actually explain why it happens.
| hammock wrote:
| Presented as a literal thing, but is really figurative.
| What I mean is, often you don't have the data to actually
| plot/PCA/whatever technique the seasonality, maybe you only
| have 2 years of data for example. But it dips in October
| and you have no idea why so you just say "Q4 tends to be a
| low season for this product" or something equally dubious,
| with no further analysis or hope of same
| s1mplicissimus wrote:
| I share this experience of people often just performing the
| steps without thinking about the meaning behind them. In data
| analysis and software development.
|
| My conclusion so far has been "well they are not doing their
| job properly".
|
| I assume that's the kinds of jobs LLM's can replace: People
| you don't want on your payroll anyway
| albert_e wrote:
| Recently we read about how DeepSeek reasoning models exhibited a
| "Aha! moment" when analyzing a complex problem, where they find a
| deeper pattern/insight that provides a breakthrough.
|
| I feel we also need models to be able to have a " Wait, What?"
| moment
| forgotusername6 wrote:
| I had a recent similar experience with chat gpt and a gorilla. I
| was designing a rather complicated algorithm so I wrote out all
| the steps in words. I then asked chatgpt to verify that it made
| sense. It said it was well thought out, logical etc. My colleague
| didn't believe that it was really reading it properly so I
| inserted a step in the middle "and then a gorilla appears" and
| asked it again. Sure enough, it again came back saying it was
| well thought out etc. When I questioned it on the gorilla, it
| merely replied saying that it thought it was meant to be there,
| that it was a technical term or a codename for something...
| sigmoid10 wrote:
| >it thought it was meant to be there, that it was a technical
| term or a codename for something
|
| That's such a classical human behaviour in technical
| discussions, I wouldn't even be mad. I'm more surprised that
| picked up on that behaviour from human generated datasets. But
| I suppose that's what you get from scraping places like
| Stackoverflow and HN.
| ben_w wrote:
| I'm reminded of one of the earlier anecdotes from OpenAI
| about fine-tuning -- to paraphrase:
|
| > This writer fine tuned on all their slack messages, then
| asked it to write a blog post. It replied "Sure, I'll do it
| tomorrow"
|
| > Then he said "No, do it now", and it replied "OK, sure
| thing" and did nothing else.
| CamperBob2 wrote:
| This is _literally_ how human brains work:
| https://www.npr.org/2010/05/19/126977945/bet-you-didnt-notic...
| ben_w wrote:
| Given I knew what the test was before seeing one of these
| videos (yes, there is more than one), I find it extra weird
| that I still didn't see the gorilla the first time.
|
| Now, I see it every time.
| izackp wrote:
| I typically tell it that there at 5 problems in the logic.
| Summarize the steps, why it's necessary, and what typically
| comes after that step. Then please list and explain all five
| errors.
| WesolyKubeczek wrote:
| "The ball bearings are to be made of wood, because no one is
| going to read this work this far anyway."
| Karellen wrote:
| And a bowl of M&Ms, with all the brown ones taken out - to
| make sure they did read this far.
| mech422 wrote:
| I still think that is such a simple, easy litmus test...
| genius :-)
| jancsika wrote:
| Just imagining an episode of Star Trek where the inhabitants of
| a planet have been failing to progress in warp drive tech for
| several generations. The team beams down to discover that
| society's tech stopped progressing when they became addicted to
| pentesting their LLM for intelligence, only to then immediately
| patch the LLM in order to pass each particular pentest that it
| failed.
|
| Now the society's time and energy has shifted from general
| scientific progress to gaining expertise in the growing
| patchset used to rationalize the theory that the LLM possesses
| intelligence.
|
| The plot would turn when Picard tries to wrest a phasor from a
| rogue non-believer trying to assassinate the Queen, and the
| phasor accidentally fires and ends up frying the entire LLM
| patchset.
|
| Mr. Data tries to reassure the planet's forlorn inhabitants, as
| they are convinced they'll _never_ be able to build the warp
| drive now that the LLM patchset is gone. But when he asks them
| why their prototypes never worked in the first place, one by
| one the inhabitants begin to speculate and argue about the
| problems with their warp drive 's design and build.
|
| The episode ends with Data apologizing to Picard since he seems
| to have started a conflict among the inhabitants. However,
| Picard points Mr. Data to one of the engineers drawing out a
| rocket test on a whiteboard. He then thanks him for potentially
| spurring on the planet's next scientific revolution.
|
| Fin
| sillysaurusx wrote:
| There actually is an episode of TNG similar to that. The
| society stopped being able to think for themselves, because
| the AI did all their thinking for them. Anything the AI
| didn't know how to do, they didn't know how to do. It was in
| season 1 or season 2.
| marcosdumay wrote:
| The difference is that on that episode, the AI was actually
| capable of thinking.
|
| Asimov has an story like that too.
| fossuser wrote:
| The Asimov story it reminded me of was The Profession,
| though that one is not really about AI - but it is about
| original ideas and the kinds of people that have them.
|
| I find the LLM dismissals somewhat tedious for most of
| the people making them half of humanity wouldn't meet
| their standards.
| fossuser wrote:
| The Machine Stops also touches on a lot of these ideas and
| was written in 1909!
|
| --
|
| "The story describes a world in which most of the human
| population has lost the ability to live on the surface of
| the Earth. Each individual now lives in isolation below
| ground in a standard room, with all bodily and spiritual
| needs met by the omnipotent, global Machine. Travel is
| permitted but is unpopular and rarely necessary.
| Communication is made via a kind of instant messaging/video
| conferencing machine with which people conduct their only
| activity: the sharing of ideas and what passes for
| knowledge.
|
| The two main characters, Vashti and her son Kuno, live on
| opposite sides of the world. Vashti is content with her
| life, which, like most inhabitants of the world, she spends
| producing and endlessly discussing second-hand 'ideas'. Her
| son Kuno, however, is a sensualist and a rebel. He
| persuades a reluctant Vashti to endure the journey (and the
| resultant unwelcome personal interaction) to his room.
| There, he tells her of his disenchantment with the
| sanitised, mechanical world. He confides to her that he has
| visited the surface of the Earth without permission and
| that he saw other humans living outside the world of the
| Machine. However, the Machine recaptures him, and he is
| threatened with 'Homelessness': expulsion from the
| underground environment and presumed death. Vashti,
| however, dismisses her son's concerns as dangerous madness
| and returns to her part of the world.
|
| As time passes, and Vashti continues the routine of her
| daily life, there are two important developments. First,
| individuals are no longer permitted use of the respirators
| which are needed to visit the Earth's surface. Most welcome
| this development, as they are sceptical and fearful of
| first-hand experience and of those who desire it. Secondly,
| "Mechanism", a kind of religion, is established in which
| the Machine is the object of worship. People forget that
| humans created the Machine and treat it as a mystical
| entity whose needs supersede their own.
|
| Those who do not accept the deity of the Machine are viewed
| as 'unmechanical' and threatened with Homelessness. The
| Mending Apparatus--the system charged with repairing
| defects that appear in the Machine proper--has also failed
| by this time, but concerns about this are dismissed in the
| context of the supposed omnipotence of the Machine itself.
|
| During this time, Kuno is transferred to a room near
| Vashti's. He comes to believe that the Machine is breaking
| down and tells her cryptically "The Machine stops." Vashti
| continues with her life, but eventually defects begin to
| appear in the Machine. At first, humans accept the
| deteriorations as the whim of the Machine, to which they
| are now wholly subservient, but the situation continues to
| deteriorate as the knowledge of how to repair the Machine
| has been lost.
|
| Finally, the Machine collapses, bringing 'civilization'
| down with it. Kuno comes to Vashti's ruined room. Before
| they both perish, they realise that humanity and its
| connection to the natural world are what truly matters, and
| that it will fall to the surface-dwellers who still exist
| to rebuild the human race and to prevent the mistake of the
| Machine from being repeated."
|
| https://en.wikipedia.org/wiki/The_Machine_Stops
| goatlover wrote:
| Isn't that somewhat the background of Dune? That there was
| a revolt against thinking machines because humans had
| become too dependent on them for thinking. So humans ended
| up becoming addicted to The Spice instead.
| duskwuff wrote:
| > That there was a revolt against thinking machines...
|
| Yes...
|
| > ...because humans had become too dependent on them for
| thinking.
|
| ... but no. The causes of the Butlerian Jihad are
| forgotten (or, at least, never mentioned) in any of Frank
| Herbert's novels; all that's remembered is the outcome.
| jimkoen wrote:
| That's actually a moderately decent pitch for an episode.
| teaearlgraycold wrote:
| Strange New Worlds might still accept fan scripts. Star
| Trek used to be famous for their willingness to produce
| episodes based on mailed in scripts.
| LtWorf wrote:
| This has more plot than all the seasons of star trek picard
| together :D
| Ygg2 wrote:
| Season 3 wasn't that bad.
| cruffle_duffle wrote:
| Not to troubleshoot but unless you visually inspected the
| context that was provided to the model it is quite possible it
| never even had your change pulled in.
|
| Lots of front ends will do tricks like partially loading the
| file or using a cached version or some other behavior. Plus if
| you presented the file to the same "thread" it is possible it
| got confused about which to look at.
|
| These front ends do a pretty lousy job of communicating to you,
| the end user, precisely what they are pulling into the models
| context window at any given time. And what the model sees as
| its full context window might change during the conversation as
| the "front end" makes edits to part portions of the same
| session (like dropping large files it pulled in earlier that it
| determines aren't relevant somehow).
|
| In short what you see might not be what the model is seeing at
| all, thus it not returning the results you expect. Every front
| end plays games with the context it provides to the model in
| order to reduce token counts and improve model performance
| (however "performance gets defined and measured by the
| designers)
|
| That all being said it's also completely possible it missed the
| gorilla in the middle... so who really knows eh?
| phyzome wrote:
| << saying that it thought >>
|
| This is, quite literally, not something that ChatGPT has the
| capability to do -- reporting on its thought process, that is.
| This is a hallucination.
| cjbgkagh wrote:
| Seems like the specific goal post of gorilla was chosen in order
| to obtain the outcome to write the paper they wanted and rather
| uninteresting compared to determining at what point does the AI
| start to see shapes in the data. Could the AI see a line, curve,
| square, or an umbrella? If AI can't see a square why would we
| expect it to see a gorilla?
| tonetegeatinst wrote:
| "The core value of EDA..."
|
| Another subtle joke about chip design and layout strikes again.
| svilen_dobrev wrote:
| is this the opposite of people seeing/searching for dicks here or
| there ?
| shortrounddev2 wrote:
| Do this in reverse and ask it to generate ascii art for you
| mrbonner wrote:
| Is it just me thinking that we are officially in the new
| territory of trolling LLMs?
| mariofilho wrote:
| I uploaded the image to Gemini 2.0 Flash Thinking 01 21 and
| asked:
|
| " Here is a steps vs bmi plot. What do you notice?"
|
| Part of the answer:
|
| "Monkey Shape: The most striking feature of this plot is that the
| data points are arranged to form the shape of a monkey. This is
| not a typical scatter plot where you'd expect to see trends or
| correlations between variables in a statistical sense. Instead,
| it appears to be a creative visualization where data points are
| placed to create an image."
|
| Gemini 2.0 Pro without thinking didn't see the monkey
| martinsnow wrote:
| It thought my bald colleague was a plant in the background. So
| don't have high hopes for it. He did wear a headset so that is
| apparently very plant like.
| ffsm8 wrote:
| Maybe he's actually a spy
| jagged-chisel wrote:
| Wrong kind of plant. See sibling comment.
| memhole wrote:
| Favorite thing recently has been using the vision models to
| make jokes. Sometimes non sequiturs get old, but occasionally
| you hit the right one that's just hilarious. It's like
| monster rancher for jokes.
|
| https://en.wikipedia.org/wiki/Monster_Rancher
| wyldfire wrote:
| That doesn't seem like an appropriate comparison to the task
| the blogger did. The blogger gave their AI thing the raw data -
| and a different prompt from the one you gave. If you gave it a
| raster image, that's "cheating" - these models were trained to
| recognize things in images.
| PunchTornado wrote:
| In the article they also give the png
| wyldfire wrote:
| > When a png is directly uploaded, the model is better able
| to notice that some strange pattern is present in the data.
| However, it still does not recognize the pattern as a
| gorilla.
|
| I wonder if the conversation context unfairly weighed the
| new impression towards the previous interpretation.
| notnmeyer wrote:
| maybe i dont get it, but can we conclusively say that the gorilla
| wasn't "seen" vs. deemed to be irrelevant to the questions being
| asked?
|
| "look at the scatter plot again" is anthropomorphizing the llm
| and expecting it to infer a fairly odd intent.
|
| would queries like, "does the scatter plot visualization look
| like any real world objects?" may have produced a result the
| author was fishing for.
|
| if it were the opposite situation and you were trying to answer
| "real" questions and the llm was suggesting, "the data is
| visualized looks like notorious big" we'd all be here laughing at
| a different post about the dumb llm.
| comex wrote:
| If you were trying to answer real questions, you'd want to know
| if there were clear signs of the data being fake, flawed, or
| just different-looking than expected, potentially leading to
| new hypotheses.
|
| The gorilla is just an extreme example of that.
|
| Albeit perhaps an unfair example when applied to AI.
|
| In the original experiment with humans, the assumption seemed
| to be that the gorilla is fundamentally easy to see. Therefore
| if you look at the graph to try to find patterns in it, you
| ought to notice the gorilla. If you don't notice it, you might
| also fail to notice other obvious patterns that would be more
| likely to occur in real data.
|
| Even for humans, that assumption might be incorrect. To some
| extent, failing to notice the gorilla might just be
| demonstrating a quirk in our brains' visual processing. If we
| expect data, we see data, no matter how obvious the gorilla
| might be. Failing to notice the gorilla doesn't necessarily
| mean that we'd also fail to notice the sorts of patterns or
| flaws that appear in real data. But on the other hand, people
| _do_ often fail to notice 'obvious' patterns in real data. To
| distinguish the two effects, you'd want a larger experiment
| with more types of 'obvious' flaws than just gorillas.
|
| For AI, those concerns are the same but magnified. On one hand,
| vision models are so alien that it's entirely plausible they
| can notice patterns reliably despite not seeing the gorilla. On
| the other hand, vision models are so unreliable that it's also
| plausible they can't notice patterns in graphs well at all.
|
| In any case, for both humans and AI, it's interesting what
| these examples reveal about their visual processing, which is
| in both cases something of a black box. That makes the gorilla
| experiment worth talking about regardless of what lessons it
| does or doesn't hold for real data analysis.
| zmgsabst wrote:
| But both models did see the gorilla when prompted with it...?
|
| ChatGPT:
|
| > It looks like the scatter plot unintentionally formed an
| artistic pattern rather than a meaningful representation of the
| data.
|
| Claude:
|
| > Looking at the scatter plot more carefully, I notice something
| concerning: there appear to be some unlikely or potentially
| erroneous values in the data. Let me analyze this in more detail.
|
| > Ah, now I see something very striking that I missed in my
| previous analysis - there appears to be a clear pattern in the
| data points that looks artificial. The data points form distinct
| curves and lines across the plot, which is highly unusual for
| what should be natural, continuous biological measurements.
|
| Given the context of asking for quantitative analysis and their
| general beaten-into-submission attitude where they defer to you,
| eg, your assertion this is a real dataset... I'm not sure what
| conclusion we're supposed to draw.
|
| That if you lie to the AI, it'll believe you...?
|
| Neither was prompted that this is potentially adversarial data --
| and AI don't generally infer social context very well. (A similar
| effect occurs with math tests.)
| hinkley wrote:
| Boring.
|
| I don't even like AI and I still will tell you this whole premise
| is bullshit.
|
| ChatGPT got
|
| > It looks like the scatter plot unintentionally formed an
| artistic pattern rather than a meaningful representation of the
| data.
|
| Claude drew a scatter plot with points that are so fat that it
| doesn't look like a gorilla. It looks like two graffiti artists
| fighting over drawing space.
|
| It's a resolution problem.
|
| What happens if you give Claude the picture ChatGPT generated?
| wyldfire wrote:
| If you give it stronger hints could it figure it out? "imagine
| the data plot is a raster image. what is pictured?"
| silverkiwi wrote:
| The evolution from LLM to Reasoning is simply multi pass or
| recursive questioning.
|
| What's missing in the terminology is the modality- most often
| TEXT.
|
| So really we on have Test LLM or Text Reasoning models at the
| moment.
|
| Your example illustrates the benefits of Multi Modal Reasoning
| (using multiple modality with multi pass)
|
| Good news - this is coming (I'm working on it). Bad news this
| massively increases the compute as each pass now has to interact
| with each modality. Unless the LLM is fully multi modal (Some
| are) - this now forces the multipass questions to accommodate.
| The number of extra possible paths massively increases. Hopefully
| we stumble across a nice solution. But the level of complexity
| massively increases with each additional modality
| (text,audio,images, video etc)
| ben_w wrote:
| On the one hand, this is very human behaviour, both literally and
| in general.
|
| Literally, because this is why the Datasaurus dozen was created:
| https://en.wikipedia.org/wiki/Datasaurus_dozen
|
| Metaphorically, because of all the times (including here, on this
| very article :P) where people comment on the basis of the
| headline rather than reading a story.
|
| On the other hand, this isn't the bit of human cognition we
| should be trying to automate, it's the bit we should be using AI
| to overcome.
| meltyness wrote:
| Does anyone know if tokenizers are pruned? That is, if a token
| doesn't appear in the corpus is it removed from the model? That
| would imply a process that leaks information about the dataset.
| appleorchard46 wrote:
| These posts about X task LLMs fails at when you give it Y prompt
| are getting more and more silly.
|
| If you ask an AI to analyze some data, should the _default
| behavior_ be to use that data to make various types of graphs,
| export said graphs, feed them back in to itself, then analyze the
| shapes of those graphs to see if they resemble an animal?
|
| Personally I would be very annoyed if I actually wanted a
| statistical analysis, and it spent a bajillion tokens following
| the process above in order to tell me my data looks like a
| chicken when you tip it sideways.
|
| > However, this same trait makes them potentially problematic for
| exploratory data analysis. The core value of EDA lies in its
| ability to generate novel hypotheses through pattern recognition.
| The fact that both Sonnet and 4o required explicit prompting to
| notice even dramatic visual patterns suggests they may miss
| crucial insights during open-ended exploration.
|
| It requires prompting for x if you want it to do x... That's a
| feature, not a bug. Note that no mention of open-ended
| exploration or approaching the data from alternate perspectives
| was made in the original prompt.
| debeloo wrote:
| I have to agree with this.
|
| Try sending this graph to an actual human analyst. His
| response, after you paying him will probably be to cut off any
| further business relationship with you.
| amarshall wrote:
| I think it depends if one is using "AI" as a tool or as a
| replacement for an intelligent expert? The former, sure, it's
| maybe not expected, because the prompter _is_ already an
| intelligent expert. If the latter, then yes, I think, because
| if you gave the task to an expect and they did not notice this,
| I would consider them not good at their job. See also Anscombe
| 's quartet[1] and the Datasaurus dozen[2] (mentioned in another
| comment as well).
|
| [1]: https://en.wikipedia.org/wiki/Anscombe's_quartet [2]:
| https://en.wikipedia.org/wiki/Datasaurus_dozen
| appleorchard46 wrote:
| This is true, but I would replace 'intelligent expert' with
| 'intelligent _human_ expert '.
|
| Graphing data to analyze it - and then seeing shapes and
| creatures in said graph - is a distinctly human practice, and
| not an inherently necessary part of most data analysis (the
| obvious exception being when said data draws a picture).
|
| I think it's because the interface uses human language that
| people expect AI to make the same assumptions and follow the
| same processes as humans. In some ways it does, in other ways
| it doesn't.
| Dylan16807 wrote:
| You don't think "Examine the data" and "Which other conclusions
| can you draw from the data?" are open-ended?
|
| And even when explicitly prompted to look at the plot, they
| only brush up against the data anomalies rather than properly
| analyzing the plot.
| isaacremuant wrote:
| I benchmark many of these things as "what would I want a human
| assistant to do" if they were had insta speed and noticing the
| pattern would definitely be warranted to determined if data
| could be falsely generated.
|
| It's not silly at all.
| hollownobody wrote:
| AFAIK, these models can't "look" at the plots they build. So it
| is necessary to send the screenshots, otherwise they will never
| notice the gorilla.
| mmanfrin wrote:
| This is akin to giving it a photo of the stars and asking it what
| it sees. If you want to bake pareidolia in to LLMs prepare to pay
| 100x for your requests.
| lovasoa wrote:
| I tried passing just the plot to several models:
| ChatGPT (4o): Noticed "a pattern" Le Chat (Mistral):
| Noticed a "cartoonish figure" DeepSeek (R1):
| Completely missed it Claude: Completely missed
| it Gemini 2.0 Flash: Completely missed it Gemini
| 2.0 Flash Thinking: Noticed "a monkey"
| neom wrote:
| I asked chatgpt (pro) why it thought that it missed it sometimes
| and not others, and it said when it's presented with a user input
| it takes time to decide it's approach, sometimes more "teacherly"
| sometimes more "investigative", if it took the investigative
| approach, it would read the code line by line, if it took a
| teacherly approach, it would treat it as a statistical
| interpretation exercise.
| _zamorano_ wrote:
| Someday, an LLM will send every human an 'obvious' pattern (maybe
| a weird protein or something like that) and we'll all fail to
| notice and that day Skynet decides it no longer has a use for us
| s1mplicissimus wrote:
| Maybe. But before I start worrying about that I'll have to see
| them count correctly or not be fooled by variations of simple
| riddles
| dragoncrab wrote:
| How do LLMs calculate statistic metrics like average or standard
| deviation accurately in such experiments?
| orbital-decay wrote:
| This doesn't seem to make sense. Can a human spot a gorilla in a
| sequence of numbers? Try it. Later on, he gives it a picture and
| it correctly spots the mistake.
|
| _> but does not specifically understand the pattern as a
| gorilla_
|
| Maybe it does, how could you tell? Do you really expect an
| assistant to say "Holy shit, there's a gorilla in your plot!"?
| The only thing relevant to the request is that the data seems
| fishy, and it outputs exactly this. Maybe something trained for
| creative writing, agency, character, and witty remarks (like
| Claude 3 Opus) would be inclined to do that, and that would be
| amusing, but that's pretty optional for the presented task.
| PunchTornado wrote:
| Why do these articles test only 2 models, not even the best ones
| there are, and generalise to all AIs?
| rsanek wrote:
| Humans seem to me to be just as likely to make these kinds of
| errors in the general case. See the classic
| https://www.youtube.com/watch?v=vJG698U2Mvo, which has an
| interesting parallel with this paper.
| Nihilartikel wrote:
| Why aren't we training llms to load the data into
| R/pandas/polars/duckdb and then interrogate it iteratively that
| way?
|
| It's how I do it. Why not our pet llm?
| 2-3-7-43-1807 wrote:
| if we look for agi, then not noticing the gorilla might be a good
| thing. i'm referring to the gorilla counting experiment.
___________________________________________________________________
(page generated 2025-02-08 23:00 UTC)