[HN Gopher] Seven replies to the viral Apple reasoning paper and...
       ___________________________________________________________________
        
       Seven replies to the viral Apple reasoning paper and why they fall
       short
        
       Author : spwestwood
       Score  : 145 points
       Date   : 2025-06-14 19:52 UTC (3 hours ago)
        
 (HTM) web link (garymarcus.substack.com)
 (TXT) w3m dump (garymarcus.substack.com)
        
       | bluefirebrand wrote:
       | I'm glad to read articles like this one, because I think it is
       | important that we pour some water on the hype cycle
       | 
       | If we want to get serious about using these new AI tools then we
       | need to come out of the clouds and get real about their
       | capabilities
       | 
       | Are they impressive? Sure. Useful? Yes probably in a lot of cases
       | 
       | But we cannot continue the hype this way, it doesn't serve anyone
       | except the people who are financially invested in these tools.
        
         | fhd2 wrote:
         | Even of the people invested in these tools, hype only benefits
         | those attempting a pump and dump scheme, or those selling
         | training, consulting or similar services around AI.
         | 
         | People who try to make genuine progress, while there's more
         | money in it now, might just have to deal with another AI winter
         | soon at this rate.
        
           | bluefirebrand wrote:
           | > hype only benefits those attempting a pump and dump scheme
           | 
           | I read some posts the other day saying Sam Altman sold off a
           | ton of his OpenAI shares. Not sure if it's true and I can't
           | find a good source, but if it is true then "pump and dump"
           | does look close to the mark
        
             | aeronaut80 wrote:
             | You probably can't find a good source because sources say
             | he has a negligible stake in OpenAI.
             | https://www.cnbc.com/amp/2024/12/10/billionaire-sam-
             | altman-d...
        
               | bluefirebrand wrote:
               | Interesting
               | 
               | When I did a cursory search, this information didn't turn
               | up either
               | 
               | Thanks for correcting me. I suppose the stuff I saw the
               | other day was just BS then
        
               | aeronaut80 wrote:
               | To be fair I struggle to believe he's doing it out of the
               | goodness of his heart.
        
           | spookie wrote:
           | Think the same thing, we need more breakthroughs. Until then,
           | it is still risky to rely on AI for most applications.
           | 
           | The sad thing is that most would take this comment the wrong
           | way. Assuming it is just another doomer take. No, there is
           | still a lot to do, and promissing the world too soon will
           | only lead to disappointment.
        
             | Zigurd wrote:
             | This is the thing of it: _" for most applications."_
             | 
             | LLMs are not thinking. They way they fail, which is
             | confidently and articulately, is one way they reveal there
             | is no mind behind the bland but well-structured text.
             | 
             | But if I was tasked with finding 500 patents with weak
             | claims or claims that have been litigated and knocked down,
             | I would turn into LLMs to help automate that. One or two
             | "nines" of reliability is fine, and LLMs would turn this
             | previously impossible task into something plausible to take
             | on.
        
         | mountainriver wrote:
         | I'll take critiques from someone who knows what a test train
         | split is.
         | 
         | The idea that a guy so removed from machine learning has
         | something relevant to say about its capabilities really speaks
         | to the state of AI fear
        
           | devwastaken wrote:
           | experts are often blinded by their paychecks to see how
           | nonsense their expertise is
        
             | soulofmischief wrote:
             | [citation needed]
        
               | Spooky23 wrote:
               | Remember Web 3.0? Lol
        
               | Zigurd wrote:
               | It's unfortunate that a discussion about LLM weaknesses
               | is giving crypto bro. But telling. There are a lot of
               | bubble valuations out there.
        
             | mountainriver wrote:
             | Not knowing the most basic things about the subject you are
             | critiquing is utter nonsense. Defending someone who does
             | this is even worse
        
           | Spooky23 wrote:
           | The idea that practitioners would try to discredit research
           | to protect the golden goose from critique speaks to human
           | nature.
        
             | mountainriver wrote:
             | No one is discrediting research from valid places, this is
             | the victim alt-right style narrative that seems to follow
             | Gary Marcus around. Somehow the mainstream is "suppressing"
             | the real knowledge
        
         | senko wrote:
         | Gary Marcus isn't about "getting real", it's making a name for
         | himself as a contrarian to the popular AI narrative.
         | 
         | This article may seem reasonable, but here he's defending a
         | paper that in his previous article he called "A knockout blow
         | for LLMs".
         | 
         | Many of his articles seem reasonable (if a bit off) until you
         | read a couple dozen a spot a trend.
        
           | adamgordonbell wrote:
           | This!
           | 
           | For all his complaints about llms, his writing could be
           | generated by an llm with a prompt saying: 'write an article
           | responding to this news with an essay saying that you are
           | once again right that this AI stuff is overblown and will
           | never amount to anything.'
        
           | steamrolled wrote:
           | > Gary Marcus isn't about "getting real", it's making a name
           | for himself as a contrarian to the popular AI narrative.
           | 
           | That's an odd standard. Not wanting to be wrong is a
           | universal human instinct. By that logic, every person who
           | ever took any position on LLMs is automatically
           | untrustworthy. After all, they made a name for themselves by
           | being pro- or con-. Or maybe a centrist - that's a position
           | too.
           | 
           | Either he makes good points or he doesn't. Unless he has a
           | track record of distorting facts, his ideological leanings
           | should be irrelevant.
        
             | sinenomine wrote:
             | Marcus' points routinely fail to pass scrutiny, nobody in
             | the field takes him seriously. If you seek real
             | scientifically interesting LLM criticism, read Francois
             | Chollet and his Arc AGI series of evals.
        
             | senko wrote:
             | He makes many very good points:
             | 
             | For example he continusly calls out AGI hype for what it
             | is, and also showcases dangers of naive use of LLMs (eg.
             | lawyers copy-pasting hallucinated cases into their
             | documents, etc). For this, he has plenty of material!
             | 
             | He also makes some very bad points and worse inferences:
             | that LLMs as a technology are useless because they can't
             | lead to AGI, that hallucation makes LLMs useless (but then
             | he contradicts himself in another article conceding they
             | "may have some use"), that because they can't follow an
             | algorithm they're useless, etc, that scaling laws are over
             | therefore LLMs won't advance (he's been making that for a
             | couple of years), that AI bubble will collapse in a few
             | months (also a few years of that), etc.
             | 
             | Read any of his article (I've read too many, sadly) and
             | you'll never come to the conclusion that LLMs might be a
             | useful technology, or be "a good thing" even in some
             | limited way. This just doesn't fit with reality I can
             | observe with my own eyes.
             | 
             | To me, this shows he's incredibly biased. That's okay if he
             | wants to be a pundit - I couldn't blame Gruber for being
             | biased about Apple! But Marcus presents himself as the
             | authority on AI, a scientist, showing a real and unbiased
             | view on the field. In fact, he's as full of hype as Sam
             | Altman is, just in another direction.
             | 
             | Imagine he was talking about aviation, not AI. 787
             | dreamliner crashes? "I've been saying for 10 years that
             | airplanes are unsafe, they can fall from the sky!" Boeing
             | the company does stupid shit? "Blown door shows why
             | airplane makers can't be trusted" Airline goes bankrupt?
             | "Air travel winter is here"
             | 
             | I've spoken to too many intelligent people who read Marcus,
             | take him at his words and have incredibly warped views on
             | the actual potential and dangers of AI (and send me links
             | to his latest piece with "so this sounds pretty damning,
             | what's your take?"). He does real damage.
             | 
             | Compare him with Simon Willison, who also writes about AI a
             | lot, and is vocal about its shortcomings and dangers.
             | Reading Simon, I never get the feeling I'm being sold on a
             | story (either positive or negative), but that I learned
             | something.
             | 
             | Perhaps a Marcus is inevitable as a symptom of the
             | Internet's immune system to the huge amount of AI hype and
             | bullshit being thrown around. Perhaps Gary is just fed up
             | with everything and comes out guns blazing, science be
             | damned. I don't know.
             | 
             | But in my mind, he's as much BSer as the AGI singularity
             | hypers.
        
           | 2muchcoffeeman wrote:
           | What's the argument here that he's not considering all the
           | information regarding GenAI?
           | 
           | That there's a trend to his opinion?
           | 
           | If I consider all the evidence regarding gravity, all my
           | papers will be "gravity is real".
           | 
           | In what ways is he only choosing what he wants to hear?
        
             | senko wrote:
             | Replied elsewhere in the thread:
             | https://news.ycombinator.com/item?id=44279283
             | 
             | To your example about gravity, I argue that he goes from
             | "gravity is real" to "therefore we can't fly", and "yeah
             | maybe some people can but that's not really solving gravity
             | and they need to go down eventually!"
        
           | bobxmax wrote:
           | Hacker news eats his shtick up because the average HN
           | commenter is the same thing - needlessly contrarian towards
           | AI because it threatens their own ego.
        
             | g-b-r wrote:
             | I see the opposite, the wide majority of people commenting
             | on Hacker News seem now very favorable to LLMs.
        
           | newswasboring wrote:
           | What exactly is your objection here? That the guy has an
           | opinion and is writing about it?
        
             | senko wrote:
             | Replied elsewere in the thread:
             | https://news.ycombinator.com/item?id=44279283
        
         | bigyabai wrote:
         | There's something innately funny about "HN's undying optimism"
         | and "bad-news paper from Apple" reaching a head like this. An
         | unstoppable object is careening towards an impervious wall,
         | anything could happen.
        
         | DiogenesKynikos wrote:
         | I don't understand what people mean when they say that AI is
         | being hyped.
         | 
         | AI is at the point where you can have a conversation with it
         | about almost anything, and it will answer more intelligently
         | than 90% of people. That's incredibly impressive, and normal
         | people don't need to be sold on it. They're just naturally
         | impressed by it.
        
           | FranzFerdiNaN wrote:
           | I don't need a tool that's right maybe 70% of the time (and
           | that's me being optimistic). It needs to be right all the
           | time or at least tell you when it doesn't know for sure,
           | instead of just making up something. Comparing it to going
           | out in the streets and asking random people random questions
           | is not a good comparison.
        
             | newswasboring wrote:
             | > I don't need a tool that's right maybe 70% of the time
             | (and that's me being optimistic).
             | 
             | Where are you getting this from? 70%?
        
           | hellohello2 wrote:
           | Its quite simple, people upvote content that makes them feel
           | good. Most of us here are programmers and the idea that many
           | of ours skills are becoming replaceable feels quite bad.
           | Hence, people upvote delusional statements that let them
           | believe in something that feels better than objective
           | reality. With any luck, these comments will be scraped and
           | used to train the next AI generation, relieving it from the
           | burden of factuality at last.
        
           | travisgriggs wrote:
           | I get even better results talking to myself.
        
           | georgemcbay wrote:
           | AI, in the form of LLMs, can be a useful tool.
           | 
           | It is still being vastly overhyped, though, by people
           | attempting to sell the idea that we are actually close to an
           | AGI "singularity".
           | 
           | Such overhype is usually easy to handwave away as like not my
           | problem. Like, if investors get fooled into thinking this is
           | anything like AGI, well, a fool and his money and all that.
           | But investors aside this AI hype is likely to have some very
           | bad real world consequences based on the same hype-men
           | selling people on the idea that we need to generate 2-4 times
           | more power than we currently do to power this godlike AI they
           | are claiming is imminent.
           | 
           | And even right now there's massive real world impact in the
           | form of say, how much grok is polluting Georgia.
        
           | woopsn wrote:
           | If the claims about AI were that it is a great or even
           | incredible chat app, there would be no mismatch.
           | 
           | I think normal people understand curing all disease,
           | replacing all value, generating 100x stock market returns,
           | uploading our minds etc to be hype.
           | 
           | I said a few days ago, LLM is amazing product. Sad that these
           | people ruin their credibility immediately upon success.
        
         | bandrami wrote:
         | How actually useful are they though? We've had more than a year
         | now of saying these things 10X knowledge workers and creatives,
         | so.... where is the output? Is there a new office suite I can
         | try? 10 times as many mobile apps? A huge new library of
         | ebooks? Is this actually in practice producing things beyond
         | Ghibli memes and RETVRN nostalgia slop?
        
           | 2muchcoffeeman wrote:
           | I think it largely depends on what you're writing. I've had
           | it reply to corporate emails which is good since I need to
           | sound professional not human.
           | 
           | If I'm coding it still needs a lot of baby sitting and
           | sometimes I'm much faster than it.
        
             | Gigachad wrote:
             | And then the person on the end is using AI to summarise the
             | email back to normal English. To what end?
        
               | js8 wrote:
               | But look the GDP has increased!
        
               | bandrami wrote:
               | But that's what I don't get: it hasn't in that scenario
               | because that doesn't lead to a greater circulation of
               | money at any point. And that's the big thing I'm looking
               | for: something AI has created that consumers are willing
               | to pay for. Because if that doesn't end up happening no
               | amount of sunk investment is going to save the ecosystem.
        
             | bandrami wrote:
             | So this would be an interesting output to measure but I
             | have no idea how we would do that: has the volume of
             | corporate email gone up? Or the time spent creating it gone
             | down?
        
         | landl0rd wrote:
         | I am by nature strongly suspicious of LLMs. Most of the code
         | they write for me is crap. I don't like them much or use them
         | much, though I do think they'll advance enough to be highly
         | useful for me with time.
         | 
         | With that said, Marcus is an idiot who has no place in the
         | discourse. His presence just drowns out substantive or useful
         | remarks. Everything he writes is just hyperbole-infested red
         | meat for anyone who is to any extent anti-AI. It's
         | "respectability laundering": they point to him as a source,
         | thus holding him up as a valid or quality source.
        
       | hiddencost wrote:
       | Why do we keep posting stuff from Gary? He's been wrong for
       | decades but he keeps writing this stuff.
       | 
       | As far as I can tell he's the person that people reach for when
       | they want to justify their beliefs. But surely being this wrong
       | for this wrong should eventually lead to losing ones status as an
       | expert.
        
         | NoahZuniga wrote:
         | None of the arguments presented in this piece depend on his
         | authority as an expert, so this is largely irrelevant.
        
         | jakewins wrote:
         | I thought this article seemed like well articulated criticism
         | of the hype cycle - can you be more specific what you mean? Are
         | the results in the Apple paper incorrect?
        
           | astrange wrote:
           | Gary Marcus always, always says AI doesn't actually work -
           | it's his whole thing. If he's posted a correct argument it's
           | a coincidence. I remember seeing him claim real long-time AI
           | researchers like David Chapman (who's a critic himself) were
           | wrong anytime they say anything positive.
           | 
           | (em-dash avoided to look less AI)
           | 
           | Of course, the main issue with the field is the critics
           | /should/ be correct. Like, LLMs shouldn't work and nobody
           | knows why they work. But they do anyway.
           | 
           | So you end up with critics complaining it's "just a parrot"
           | and then patting themselves on the back, as if inventing a
           | parrot isn't supposed to be impressive somehow.
        
             | foldr wrote:
             | I don't read GM as saying that LLMs "don't work" in a
             | practical sense. He acknowledges that they have useful
             | applications. Indeed, if they didn't work at all, why would
             | he be advocating for regulating their use? He just doesn't
             | think they're close to AGI.
        
               | kadushka wrote:
               | The funny thing is, if you asked "what is AGI" 5 years
               | ago, most people would describe something like o3.
        
               | foldr wrote:
               | Even Sam Altman thinks we're not at AGI yet (although of
               | course it's coming "soon").
        
           | barrkel wrote:
           | You need to read everything that Gary writes with the
           | particular axe to grind he has in mind: neurosymbolic AI.
           | That's his specialism, and he essentially has a chip in his
           | shoulder about the attention probabilistic approaches like
           | LLMs are getting, and their relative success.
           | 
           | You can see this in this article too.
           | 
           | The real question you should be asking is if there is a
           | practical limitation in LLMs and LRMs revealed by the Hanoi
           | Towers problem or not, given that any SOTA model can write
           | code to solve the problem and thereby solve it with tool use.
           | Gary frames this as neurosymbolic, but I think it's a bit of
           | a fudge.
        
             | krackers wrote:
             | Hasn't the symbolic vs statistical split in AI existed for
             | a long time? With things like Cyc growing out of the
             | former. I'm not too familiar with linguistics but maybe
             | this extends there too, since I think Chomsky was heavy on
             | formal grammars over probabilistic models [1].
             | 
             | Must be some sort of cognitive sunk cost fallacy, after
             | dedicating your life to one sect, it must be emotionally
             | hard to see the other "keep winning". Of course you'd root
             | for them to fall.
             | 
             | [1] https://norvig.com/chomsky.html
        
         | mountainriver wrote:
         | It's insane, he doesn't know what a test train split is but
         | he's an AI expert? Is this where we are?
        
         | marvinborner wrote:
         | Is this supposed to be a joke reflecting point (3)?
        
       | hrldcpr wrote:
       | In case anyone else missed the original paper (and discussion):
       | 
       | https://news.ycombinator.com/item?id=44203562
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _The Illusion of Thinking: Strengths and limitations of
         | reasoning models [pdf]_ -
         | https://news.ycombinator.com/item?id=44203562 - June 2025 (269
         | comments)
         | 
         | Also this: _A Knockout Blow for LLMs?_ -
         | https://news.ycombinator.com/item?id=44215131 - June 2025 (48
         | comments)
         | 
         | Were there others?
        
       | avsteele wrote:
       | This doesn't rebut anything from the best critique of the Apple
       | paper.
       | 
       | https://arxiv.org/abs/2506.09250
        
         | Jabbles wrote:
         | Those are points (2) and (5).
        
         | foldr wrote:
         | It does rebut point (1) of the abstract. Perhaps not
         | convincingly, in your view, but it does directly addresses this
         | kind of response.
        
           | avsteele wrote:
           | Papers make specific conclusions based on specific data. The
           | paper I linked specifically rebuts the conclusions of the
           | paper. Gary makes vague statements that could be interpreted
           | as being related.
           | 
           | It is scientific malpractice to write a post supposedly
           | rebutting responses to a paper and not directly address the
           | most salient one.
        
             | foldr wrote:
             | This sort of omission would not be considered scientific
             | malpractice even in a journal article, let alone a blog
             | post. A rebuttal of a position that fails to address the
             | strongest arguments for it is a bad rebuttal, but it's not
             | scientific malpractice to write a bad paper -- let alone a
             | bad blog post.
             | 
             | I don't think I agree with you that GM isn't addressing the
             | points in the paper you link. But in any case, you're not
             | doing your argument any favors by throwing in wild
             | accusations of malpractice.
        
               | avsteele wrote:
               | Malpractice slightly hyperbolic.
               | 
               | But anybody relying on Gary's posts in order to be be
               | informed on this subject is being being mislead. This
               | isn't an isolated incident either.
               | 
               | People need to be made be aware when you read him it is
               | mere punditry, not substantive engagement with the
               | literature.
        
         | spookie wrote:
         | A paper citing arxiv papers and x.com doesn't pass my smell
         | test tbh
        
       | skywhopper wrote:
       | The quote from the Salesforce paper is important: "agents
       | displayed near-zero confidentiality awareness".
        
       | bowsamic wrote:
       | This doesn't address the primary issue: that they had no
       | methodology for choosing puzzles that weren't in the training set
       | and indeed while they claimed to have chosen puzzles that aren't
       | they didn't explain why they think that. The whole point of the
       | paper was to test LLM reasoning in untrained cases but there's no
       | reason to expect such puzzles to not part of the training set,
       | and if you don't have any way of telling if it is not or then
       | your paper is not going to work out
        
         | roywiggins wrote:
         | Isn't it worse for LLMs if an LLM that has been trained on the
         | Towers of Hanoi still can't solve it reliably?
        
         | anonthrowawy wrote:
         | how could you prove that?
        
       | mentalgear wrote:
       | AI hype-bros like to complain that real AI experts are too much
       | concerned about debunking current AI then improving it - but the
       | truth is that debunking bad AI IS improving AI. Science is a
       | process of trial and error which only works by continuously
       | questioning the current state.
        
         | neepi wrote:
         | Indeed. I completely agree with this.
         | 
         | My objection to the whole thing is the AI hype bros, which is
         | really the funding solicitation facade over everything rather
         | the truth, only has one outcome and that is that it cannot be
         | sustained. At that point all investor confidence disappears,
         | the money is gone and everyone loses access to the tools that
         | they suddenly built all their dependencies on because it's all
         | proprietary service model based.
         | 
         | Which is why I am not poking it with a 10 foot long shitty
         | stick any time in the near future. The failure mode scares me,
         | not the technology which arguably does have some use in non-
         | idiot hands.
        
           | wongarsu wrote:
           | A lot of the best internet services came around in the decade
           | after the dot-com crash. There is a chance Anthropic or
           | OpenAI may not survive when funding suddenly dries up, but
           | existing open weight models won't be majorly impacted. There
           | will always be someone willing to host DeepSeek for you if
           | you're willing to pay.
           | 
           | And while it will be sad to see model improvements slow down
           | when the bubble bursts there is a lot of untapped potential
           | in the models we already have. Especially as they become
           | cheaper and easier to run
        
             | neepi wrote:
             | Someone might host DeepSeek for you but you'll pay through
             | the nose for it and it'll be frozen in time because the
             | training cost doesn't have the revenue to keep the ball
             | rolling.
             | 
             | I'm not sure the GPU market won't collapse with it either.
             | Possibly taking out a chunk of TSMC in the process, which
             | will then have knock on effects across the whole industry.
        
               | wongarsu wrote:
               | There are already inference providers like DeepInfra or
               | inference.net whose entire business model is hosted
               | inference of open-source models. They promise not to keep
               | or use any of the data and their business model has no
               | scaling effects, so I assume they are already charging a
               | fair market rate where the price covers the costs and
               | returns a profit.
               | 
               | The GPU market will probably take a hit. But the flip
               | side of that is that the market will be flooded with
               | second-hand enterprise-grade GPUs. And if Nvidia needs
               | sales from consumer GPUs again we might see more
               | attractive prices and configurations there too. In the
               | short term a market shock might be great for hobby-scale
               | inference, and maybe even training (at the 7B scale). In
               | the long term it will hurt, but if all else fails we
               | still have AMD who are somehow barely invested in this AI
               | boom
        
         | xoac wrote:
         | Yeah this is history repeating. See for example less known
         | "Dreyfuss affair" at MIT and the brilliantly titled books:
         | "What Computers Can't Do" and its sequel "What Computers Still
         | Can't Do".
        
         | bobxmax wrote:
         | > AI hype-bros like to complain that real AI experts are too
         | much concerned about debunking current AI then improving it
         | 
         | You're acting like this is a common ocurrence lol
        
         | 3abiton wrote:
         | To hammer one point though, you have to understand that
         | researcher are desensitized to minor novel improvement that
         | translate to great value products. While obviously studying and
         | assessing the limitations of AI is crucial, to the general
         | public its capabilities are just so amazing, they can't fathom
         | why we should think about limitations. Optimizing what we have
         | is bette than rethinking the whole process.
        
         | dang wrote:
         | Can you please make your substantive points without name-
         | calling or swipes? This is in the site guidelines:
         | https://news.ycombinator.com/newsguidelines.html.
        
       | labrador wrote:
       | The key insight is that LLMs can 'reason' when they've seen
       | similar solutions in training data, but this breaks down on truly
       | novel problems. This isn't reasoning exactly, but close enough to
       | be useful in many circumstances. Repeating solutions on demand
       | can be handy, just like repeating facts on demand is handy.
       | Marcus gets this right technically but focuses too much on
       | emotional arguments rather than clear explanation.
        
         | Jabrov wrote:
         | I'm so tired of hearing this be repeated, like the whole "LLMs
         | are _just_ parrots" thing.
         | 
         | It's patently obvious to me that LLMs can reason and solve
         | novel problems not in their training data. You can test this
         | out in so many ways, and there's so many examples out there.
         | 
         | ______________
         | 
         | Edit for responders, instead of replying to each:
         | 
         | We obviously have to define what we mean by "reasoning" and
         | "solving novel problems". From my point of view, reasoning !=
         | general intelligence. I also consider reasoning to be a
         | spectrum. Just because it cannot solve the hardest problem you
         | can think of does not mean it cannot reason at all. Do note, I
         | think LLMs are generally pretty bad at reasoning. But I
         | disagree with the point that LLMs cannot reason at all or never
         | solve any novel problems.
         | 
         | In terms of some backing points/examples:
         | 
         | 1) Next token prediction can itself be argued to be a task that
         | requires reasoning
         | 
         | 2) You can construct a variety of language translation tasks,
         | with completely made up languages, that LLMs can complete
         | successfully. There's tons of research about in-context
         | learning and zero-shot performance.
         | 
         | 3) Tons of people have created all kinds of
         | challenges/games/puzzles to prove that LLMs can't reason. One
         | by one, they invariably get solved (eg. https://gist.github.com
         | /VictorTaelin/8ec1d8a0a3c87af31c25224...,
         | https://ahmorse.medium.com/llms-and-reasoning-part-i-the-
         | mon...) -- sometimes even when the cutoff date for the LLM is
         | before the puzzle was published.
         | 
         | 4) Lots of examples of research about out-of-context reasoning
         | (eg. https://arxiv.org/abs/2406.14546)
         | 
         | In terms of specific rebuttals to the post:
         | 
         | 1) Even though they start to fail at some complexity threshold,
         | it's incredibly impressive that LLMs can solve any of these
         | difficult puzzles at all! GPT3.5 couldn't do that. We're making
         | incremental progress in terms of reasoning. Bigger, smarter
         | models get better at zero-shot tasks, and I think that
         | correlates with reasoning.
         | 
         | 2) Regarding point 4 ("Bigger models might to do better"): I
         | think this is very dismissive. The paper itself shows a huge
         | variance in the performance of different models. For example,
         | in figure 8, we see Claude 3.7 significantly outperforming
         | DeepSeek and maintaining stable solutions for a much longer
         | sequence length. Figure 5 also shows that better models and
         | more tokens improve performance at "medium" difficulty
         | problems. Just because it cannot solve the "hard" problems does
         | not mean it cannot reason at all, nor does it necessarily mean
         | it will never get there. Many people were saying we'd never be
         | able to solve problems like the medium ones a few years ago,
         | but now the goal posts have just shifted.
        
           | labrador wrote:
           | I've done this excercise dozens of times because people keep
           | saying it, but I can't find an example where this is true. I
           | wish it was. I'd be solving world problems with novel
           | solutions right now.
           | 
           | People make a common mistake by conflating "solving problems
           | with novel surface features" with "reasoning outside training
           | data." This is exactly the kind of binary thinking I
           | mentioned earlier.
        
           | lossolo wrote:
           | They can't create anything novel and it's patently obvious if
           | you understand how they're implemented. But I'm just some
           | anonymous guy on HN, so maybe this time I will just cite the
           | opinion of the DeepMind CEO, who said in a recent interview
           | with The Verge (available on YouTube) that LLMs based on
           | transformers can't create anything truly novel.
        
             | labrador wrote:
             | "I don't think today's systems can invent, you know, do
             | true invention, true creativity, hypothesize new scientific
             | theories. They're extremely useful, they're impressive, but
             | they have holes."
             | 
             | Demis Hassabis On The Future of Work in the Age of AI (@
             | 2:30 mark)
             | 
             | https://www.youtube.com/watch?v=CRraHg4Ks_g
        
               | lossolo wrote:
               | Yes, this one. Thanks
        
           | bfung wrote:
           | Any links or examples available? Curious to try it out
        
           | multjoy wrote:
           | Lol, no.
        
           | aucisson_masque wrote:
           | > It's patently obvious that LLMs can reason and solve novel
           | problems not in their training data.
           | 
           | Would you care to tell us more ?
           | 
           | << It's patently obvious >> is not really an argument, I
           | could say just as well that everyone know LLM can't resonate
           | or think (in the way we living beings do).
        
           | andrewmcwatters wrote:
           | It's definitely not true in any meaningful sense. There are
           | plenty of us practitioners in software engineering wishing it
           | was true, because if it was, we'd all have genius interns
           | working for us on Mac Studios at home.
           | 
           | It's not true. It's plainly not true. Go have any of these
           | models, paid, or local try to build you novel solutions to
           | hard, existing problems despite being, in some cases, trained
           | on literally the entire compendium of open knowledge in not
           | just one, but multiple adjacent fields. Not to mention the
           | fact that being able to abstract general knowledge would mean
           | it _would_ be able to reason.
           | 
           | They. Cannot. Do it.
           | 
           | I have no idea what you people are talking about because you
           | cannot be working on anything with real substance that hasn't
           | been perfectly line fit to your abundantly worked on
           | problems, but no, these models are obviously not reasoning.
           | 
           | I built a digital employee and gave it menial tasks that
           | compare to current cloud solutions who also claim to be able
           | to provide you paid cloud AI employees and these things are
           | stupider than fresh college grads.
        
           | goalieca wrote:
           | So far they cannot even answer questions which are straight
           | up fact checking and search engine like queries. Reasoning
           | means they would be able to work through a problem and
           | generate a proof they way a student might.
        
         | swat535 wrote:
         | If that was the case, it would have been great already but
         | these tools can't even do that. They frequently make mistake
         | repeating the same solutions available everywhere during their
         | "reasoning" process and fabricates plausible hallucinations
         | which you then have to inspect carefully to catch.
        
         | aucisson_masque wrote:
         | That's the opposite of reasoning tho. Ai bros want to make
         | people believe LLM are smart but they're not capable of
         | intelligence and reasoning.
         | 
         | Reasoning mean you can take on a problem you've never seen
         | before and think of innovative ways to solve it.
         | 
         | LLM can only replicate what is in its data, it can in no way
         | think or guess or estimate what will likely be the best
         | solution, it can only output a solution based on a probability
         | calculation made on how frequent it has seen this solution
         | linked to this problem.
        
           | labrador wrote:
           | You're assuming we're saying LLMs can't reason. That's not
           | what we're saying. They can execute reasoning-like processes
           | when they've seen similar patterns, but this breaks down when
           | true novel reasoning is required. Most people do the same
           | thing. Some poeple can come up with novel solutions to new
           | problems, but LLMs will choke. Here's an example:
           | 
           | Prompt: "Let's try a reasoning test. Estimate how many pianos
           | there are at the bottom of the sea."
           | 
           | I tried this on three advanced AIs* and they all choked on it
           | without further hints from me. Claude then said:
           | Roughly 3 million shipwrecks on ocean floors globally
           | Maybe 1 in 1000 ships historically carried a piano (passenger
           | ships, luxury vessels)         So ~3,000 ships with pianos
           | sunk         Average maybe 0.5 pianos per ship (not all
           | passenger areas had them)         Estimate: ~1,500 pianos
           | 
           | *Claude Sonnet 4, Google Gemini 2.5 and GPT 4o
        
             | Jabrov wrote:
             | That seems like a totally reasonable response ... ?
        
       | ummonk wrote:
       | Most of the objections and their counterarguments seem like
       | either poor objections (e.g. ad hominem against the first listed
       | author) or seem to be subsumed under point 5. It's annoying that
       | most of this post focuses so much effort on discussing most of
       | the other objections when the important discussion is the one to
       | be had in point 5:
       | 
       | I.e. to what extent are LLMs able to reliably make use of writing
       | code or using logic systems, and to what extent does
       | hallucinating / providing faulty answers in the absence of such
       | tool access demonstrate an inability to truly reason (I'd expect
       | a smart human to just say "that's too much" or "that's beyond my
       | abilities" rather than do a best effort faulty answer)?
        
         | thomasahle wrote:
         | > I'd expect a smart human to just say "that's too much" or
         | "that's beyond my abilities" rather than do a best effort
         | faulty answer)?
         | 
         | That's what the models did. They gave the first 100 steps, then
         | explained how it was too much to output all of it, and gave the
         | steps one would follow to complete it.
         | 
         | They were graded as "wrong answer" for this.
         | 
         | ---
         | 
         | Source:
         | https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...
         | 
         | > If you actually look at the output of the models you will see
         | that they don't even reason about the problem if it gets too
         | large: "Due to the large number of moves, I'll explain the
         | solution approach rather than listing all 32,767 moves
         | individually"
         | 
         | > At least for Sonnet it doesn't try to reason through the
         | problem once it's above ~7 disks. It will state what the
         | problem and the algorithm to solve it and then output its
         | solution without even thinking about individual steps.
        
         | FINDarkside wrote:
         | I don't think most of the objections are poor at all apart from
         | 3, it's this article that seems to make lots of strawmans.
         | Especially the first objection is often heard because people
         | claim "this paper proves LLMs don't reason". The author moves
         | goalposts and is arguing against about whether LLMs lead to
         | AGI, which is already a strawman for those arguments. And in
         | addition, he even seems to misunderstand AGI, thinking it's
         | some sort of super intelligence ("We have every right to expect
         | machines to do things we can't"). AI that can do everything at
         | least as good as average human is AGI by definition.
         | 
         | It's especially weird argument considering that LLMs are
         | already ahead of humans in Tower of Hanoi. I bet average person
         | will not be able to "one-shot" you the moves to 8 disk tower of
         | Hanoi without writing anything down or tracking the state with
         | the actual disks. LLMs have far bigger obstacles to reaching
         | AGI though.
         | 
         | 5 is also a massive strawman with the "not see how well it
         | could use preexisting code retrieved from the web" as well,
         | given that these models will write code to solve these kind of
         | problems even if you come up with some new problem that
         | wouldn't exist in its training data.
        
       | wohoef wrote:
       | Good article giving some critique to Apple's paper and Gary
       | Marcus specifically.
       | 
       | https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...
        
         | hintymad wrote:
         | Honest question: does the opinion of Gary Marcus still count?
         | His criticism seems more philosophical than scientific. It's
         | hard for me see what he builds or reasons to get to his
         | conclusions.
        
       | brcmthrowaway wrote:
       | In classic ML, you never evaluste against data that was in the
       | training set. In LLMs, everything is the training set. Doesn't
       | this seem wrong?
        
       | thomasahle wrote:
       | > 1. Humans have trouble with complex problems and memory
       | demands. True! But incomplete. We have every right to expect
       | machines to do things we can't. [...] If we want to get to AGI,
       | we will have to better.
       | 
       | I don't get this argument. The paper is about "whether RLLMs can
       | think". If we grant "humans make these mistakes too", but also
       | "we still require this ability in our definition of thinking",
       | aren't we saying "thinking in humans is a illusion" too?
        
       | thomasahle wrote:
       | > 5. A student might complain about a math exam requiring
       | integration or differentiation by hand, even though math software
       | can produce the correct answer instantly. The teacher's goal in
       | assigning the problem, though, isn't finding the answer to that
       | question (presumably the teacher already know the answer), but to
       | assess the student's conceptual understanding. Do LLM's
       | conceptually understand Hanoi? That's what the Apple team was
       | getting at. (Can LLMs download the right code? Sure. But
       | downloading code without conceptual understanding is of less help
       | in the case of new problems, dynamically changing environments,
       | and so on.)
       | 
       | Why is he talking about "downloading" code? The LLMs can easily
       | "write" out out the code themselves.
       | 
       | If the student wrote a software program for general
       | differentiation during the exam, they obviously would have a
       | great conceptual understanding.
        
       ___________________________________________________________________
       (page generated 2025-06-14 23:00 UTC)