[HN Gopher] Seven replies to the viral Apple reasoning paper and...
___________________________________________________________________
Seven replies to the viral Apple reasoning paper and why they fall
short
Author : spwestwood
Score : 145 points
Date : 2025-06-14 19:52 UTC (3 hours ago)
(HTM) web link (garymarcus.substack.com)
(TXT) w3m dump (garymarcus.substack.com)
| bluefirebrand wrote:
| I'm glad to read articles like this one, because I think it is
| important that we pour some water on the hype cycle
|
| If we want to get serious about using these new AI tools then we
| need to come out of the clouds and get real about their
| capabilities
|
| Are they impressive? Sure. Useful? Yes probably in a lot of cases
|
| But we cannot continue the hype this way, it doesn't serve anyone
| except the people who are financially invested in these tools.
| fhd2 wrote:
| Even of the people invested in these tools, hype only benefits
| those attempting a pump and dump scheme, or those selling
| training, consulting or similar services around AI.
|
| People who try to make genuine progress, while there's more
| money in it now, might just have to deal with another AI winter
| soon at this rate.
| bluefirebrand wrote:
| > hype only benefits those attempting a pump and dump scheme
|
| I read some posts the other day saying Sam Altman sold off a
| ton of his OpenAI shares. Not sure if it's true and I can't
| find a good source, but if it is true then "pump and dump"
| does look close to the mark
| aeronaut80 wrote:
| You probably can't find a good source because sources say
| he has a negligible stake in OpenAI.
| https://www.cnbc.com/amp/2024/12/10/billionaire-sam-
| altman-d...
| bluefirebrand wrote:
| Interesting
|
| When I did a cursory search, this information didn't turn
| up either
|
| Thanks for correcting me. I suppose the stuff I saw the
| other day was just BS then
| aeronaut80 wrote:
| To be fair I struggle to believe he's doing it out of the
| goodness of his heart.
| spookie wrote:
| Think the same thing, we need more breakthroughs. Until then,
| it is still risky to rely on AI for most applications.
|
| The sad thing is that most would take this comment the wrong
| way. Assuming it is just another doomer take. No, there is
| still a lot to do, and promissing the world too soon will
| only lead to disappointment.
| Zigurd wrote:
| This is the thing of it: _" for most applications."_
|
| LLMs are not thinking. They way they fail, which is
| confidently and articulately, is one way they reveal there
| is no mind behind the bland but well-structured text.
|
| But if I was tasked with finding 500 patents with weak
| claims or claims that have been litigated and knocked down,
| I would turn into LLMs to help automate that. One or two
| "nines" of reliability is fine, and LLMs would turn this
| previously impossible task into something plausible to take
| on.
| mountainriver wrote:
| I'll take critiques from someone who knows what a test train
| split is.
|
| The idea that a guy so removed from machine learning has
| something relevant to say about its capabilities really speaks
| to the state of AI fear
| devwastaken wrote:
| experts are often blinded by their paychecks to see how
| nonsense their expertise is
| soulofmischief wrote:
| [citation needed]
| Spooky23 wrote:
| Remember Web 3.0? Lol
| Zigurd wrote:
| It's unfortunate that a discussion about LLM weaknesses
| is giving crypto bro. But telling. There are a lot of
| bubble valuations out there.
| mountainriver wrote:
| Not knowing the most basic things about the subject you are
| critiquing is utter nonsense. Defending someone who does
| this is even worse
| Spooky23 wrote:
| The idea that practitioners would try to discredit research
| to protect the golden goose from critique speaks to human
| nature.
| mountainriver wrote:
| No one is discrediting research from valid places, this is
| the victim alt-right style narrative that seems to follow
| Gary Marcus around. Somehow the mainstream is "suppressing"
| the real knowledge
| senko wrote:
| Gary Marcus isn't about "getting real", it's making a name for
| himself as a contrarian to the popular AI narrative.
|
| This article may seem reasonable, but here he's defending a
| paper that in his previous article he called "A knockout blow
| for LLMs".
|
| Many of his articles seem reasonable (if a bit off) until you
| read a couple dozen a spot a trend.
| adamgordonbell wrote:
| This!
|
| For all his complaints about llms, his writing could be
| generated by an llm with a prompt saying: 'write an article
| responding to this news with an essay saying that you are
| once again right that this AI stuff is overblown and will
| never amount to anything.'
| steamrolled wrote:
| > Gary Marcus isn't about "getting real", it's making a name
| for himself as a contrarian to the popular AI narrative.
|
| That's an odd standard. Not wanting to be wrong is a
| universal human instinct. By that logic, every person who
| ever took any position on LLMs is automatically
| untrustworthy. After all, they made a name for themselves by
| being pro- or con-. Or maybe a centrist - that's a position
| too.
|
| Either he makes good points or he doesn't. Unless he has a
| track record of distorting facts, his ideological leanings
| should be irrelevant.
| sinenomine wrote:
| Marcus' points routinely fail to pass scrutiny, nobody in
| the field takes him seriously. If you seek real
| scientifically interesting LLM criticism, read Francois
| Chollet and his Arc AGI series of evals.
| senko wrote:
| He makes many very good points:
|
| For example he continusly calls out AGI hype for what it
| is, and also showcases dangers of naive use of LLMs (eg.
| lawyers copy-pasting hallucinated cases into their
| documents, etc). For this, he has plenty of material!
|
| He also makes some very bad points and worse inferences:
| that LLMs as a technology are useless because they can't
| lead to AGI, that hallucation makes LLMs useless (but then
| he contradicts himself in another article conceding they
| "may have some use"), that because they can't follow an
| algorithm they're useless, etc, that scaling laws are over
| therefore LLMs won't advance (he's been making that for a
| couple of years), that AI bubble will collapse in a few
| months (also a few years of that), etc.
|
| Read any of his article (I've read too many, sadly) and
| you'll never come to the conclusion that LLMs might be a
| useful technology, or be "a good thing" even in some
| limited way. This just doesn't fit with reality I can
| observe with my own eyes.
|
| To me, this shows he's incredibly biased. That's okay if he
| wants to be a pundit - I couldn't blame Gruber for being
| biased about Apple! But Marcus presents himself as the
| authority on AI, a scientist, showing a real and unbiased
| view on the field. In fact, he's as full of hype as Sam
| Altman is, just in another direction.
|
| Imagine he was talking about aviation, not AI. 787
| dreamliner crashes? "I've been saying for 10 years that
| airplanes are unsafe, they can fall from the sky!" Boeing
| the company does stupid shit? "Blown door shows why
| airplane makers can't be trusted" Airline goes bankrupt?
| "Air travel winter is here"
|
| I've spoken to too many intelligent people who read Marcus,
| take him at his words and have incredibly warped views on
| the actual potential and dangers of AI (and send me links
| to his latest piece with "so this sounds pretty damning,
| what's your take?"). He does real damage.
|
| Compare him with Simon Willison, who also writes about AI a
| lot, and is vocal about its shortcomings and dangers.
| Reading Simon, I never get the feeling I'm being sold on a
| story (either positive or negative), but that I learned
| something.
|
| Perhaps a Marcus is inevitable as a symptom of the
| Internet's immune system to the huge amount of AI hype and
| bullshit being thrown around. Perhaps Gary is just fed up
| with everything and comes out guns blazing, science be
| damned. I don't know.
|
| But in my mind, he's as much BSer as the AGI singularity
| hypers.
| 2muchcoffeeman wrote:
| What's the argument here that he's not considering all the
| information regarding GenAI?
|
| That there's a trend to his opinion?
|
| If I consider all the evidence regarding gravity, all my
| papers will be "gravity is real".
|
| In what ways is he only choosing what he wants to hear?
| senko wrote:
| Replied elsewhere in the thread:
| https://news.ycombinator.com/item?id=44279283
|
| To your example about gravity, I argue that he goes from
| "gravity is real" to "therefore we can't fly", and "yeah
| maybe some people can but that's not really solving gravity
| and they need to go down eventually!"
| bobxmax wrote:
| Hacker news eats his shtick up because the average HN
| commenter is the same thing - needlessly contrarian towards
| AI because it threatens their own ego.
| g-b-r wrote:
| I see the opposite, the wide majority of people commenting
| on Hacker News seem now very favorable to LLMs.
| newswasboring wrote:
| What exactly is your objection here? That the guy has an
| opinion and is writing about it?
| senko wrote:
| Replied elsewere in the thread:
| https://news.ycombinator.com/item?id=44279283
| bigyabai wrote:
| There's something innately funny about "HN's undying optimism"
| and "bad-news paper from Apple" reaching a head like this. An
| unstoppable object is careening towards an impervious wall,
| anything could happen.
| DiogenesKynikos wrote:
| I don't understand what people mean when they say that AI is
| being hyped.
|
| AI is at the point where you can have a conversation with it
| about almost anything, and it will answer more intelligently
| than 90% of people. That's incredibly impressive, and normal
| people don't need to be sold on it. They're just naturally
| impressed by it.
| FranzFerdiNaN wrote:
| I don't need a tool that's right maybe 70% of the time (and
| that's me being optimistic). It needs to be right all the
| time or at least tell you when it doesn't know for sure,
| instead of just making up something. Comparing it to going
| out in the streets and asking random people random questions
| is not a good comparison.
| newswasboring wrote:
| > I don't need a tool that's right maybe 70% of the time
| (and that's me being optimistic).
|
| Where are you getting this from? 70%?
| hellohello2 wrote:
| Its quite simple, people upvote content that makes them feel
| good. Most of us here are programmers and the idea that many
| of ours skills are becoming replaceable feels quite bad.
| Hence, people upvote delusional statements that let them
| believe in something that feels better than objective
| reality. With any luck, these comments will be scraped and
| used to train the next AI generation, relieving it from the
| burden of factuality at last.
| travisgriggs wrote:
| I get even better results talking to myself.
| georgemcbay wrote:
| AI, in the form of LLMs, can be a useful tool.
|
| It is still being vastly overhyped, though, by people
| attempting to sell the idea that we are actually close to an
| AGI "singularity".
|
| Such overhype is usually easy to handwave away as like not my
| problem. Like, if investors get fooled into thinking this is
| anything like AGI, well, a fool and his money and all that.
| But investors aside this AI hype is likely to have some very
| bad real world consequences based on the same hype-men
| selling people on the idea that we need to generate 2-4 times
| more power than we currently do to power this godlike AI they
| are claiming is imminent.
|
| And even right now there's massive real world impact in the
| form of say, how much grok is polluting Georgia.
| woopsn wrote:
| If the claims about AI were that it is a great or even
| incredible chat app, there would be no mismatch.
|
| I think normal people understand curing all disease,
| replacing all value, generating 100x stock market returns,
| uploading our minds etc to be hype.
|
| I said a few days ago, LLM is amazing product. Sad that these
| people ruin their credibility immediately upon success.
| bandrami wrote:
| How actually useful are they though? We've had more than a year
| now of saying these things 10X knowledge workers and creatives,
| so.... where is the output? Is there a new office suite I can
| try? 10 times as many mobile apps? A huge new library of
| ebooks? Is this actually in practice producing things beyond
| Ghibli memes and RETVRN nostalgia slop?
| 2muchcoffeeman wrote:
| I think it largely depends on what you're writing. I've had
| it reply to corporate emails which is good since I need to
| sound professional not human.
|
| If I'm coding it still needs a lot of baby sitting and
| sometimes I'm much faster than it.
| Gigachad wrote:
| And then the person on the end is using AI to summarise the
| email back to normal English. To what end?
| js8 wrote:
| But look the GDP has increased!
| bandrami wrote:
| But that's what I don't get: it hasn't in that scenario
| because that doesn't lead to a greater circulation of
| money at any point. And that's the big thing I'm looking
| for: something AI has created that consumers are willing
| to pay for. Because if that doesn't end up happening no
| amount of sunk investment is going to save the ecosystem.
| bandrami wrote:
| So this would be an interesting output to measure but I
| have no idea how we would do that: has the volume of
| corporate email gone up? Or the time spent creating it gone
| down?
| landl0rd wrote:
| I am by nature strongly suspicious of LLMs. Most of the code
| they write for me is crap. I don't like them much or use them
| much, though I do think they'll advance enough to be highly
| useful for me with time.
|
| With that said, Marcus is an idiot who has no place in the
| discourse. His presence just drowns out substantive or useful
| remarks. Everything he writes is just hyperbole-infested red
| meat for anyone who is to any extent anti-AI. It's
| "respectability laundering": they point to him as a source,
| thus holding him up as a valid or quality source.
| hiddencost wrote:
| Why do we keep posting stuff from Gary? He's been wrong for
| decades but he keeps writing this stuff.
|
| As far as I can tell he's the person that people reach for when
| they want to justify their beliefs. But surely being this wrong
| for this wrong should eventually lead to losing ones status as an
| expert.
| NoahZuniga wrote:
| None of the arguments presented in this piece depend on his
| authority as an expert, so this is largely irrelevant.
| jakewins wrote:
| I thought this article seemed like well articulated criticism
| of the hype cycle - can you be more specific what you mean? Are
| the results in the Apple paper incorrect?
| astrange wrote:
| Gary Marcus always, always says AI doesn't actually work -
| it's his whole thing. If he's posted a correct argument it's
| a coincidence. I remember seeing him claim real long-time AI
| researchers like David Chapman (who's a critic himself) were
| wrong anytime they say anything positive.
|
| (em-dash avoided to look less AI)
|
| Of course, the main issue with the field is the critics
| /should/ be correct. Like, LLMs shouldn't work and nobody
| knows why they work. But they do anyway.
|
| So you end up with critics complaining it's "just a parrot"
| and then patting themselves on the back, as if inventing a
| parrot isn't supposed to be impressive somehow.
| foldr wrote:
| I don't read GM as saying that LLMs "don't work" in a
| practical sense. He acknowledges that they have useful
| applications. Indeed, if they didn't work at all, why would
| he be advocating for regulating their use? He just doesn't
| think they're close to AGI.
| kadushka wrote:
| The funny thing is, if you asked "what is AGI" 5 years
| ago, most people would describe something like o3.
| foldr wrote:
| Even Sam Altman thinks we're not at AGI yet (although of
| course it's coming "soon").
| barrkel wrote:
| You need to read everything that Gary writes with the
| particular axe to grind he has in mind: neurosymbolic AI.
| That's his specialism, and he essentially has a chip in his
| shoulder about the attention probabilistic approaches like
| LLMs are getting, and their relative success.
|
| You can see this in this article too.
|
| The real question you should be asking is if there is a
| practical limitation in LLMs and LRMs revealed by the Hanoi
| Towers problem or not, given that any SOTA model can write
| code to solve the problem and thereby solve it with tool use.
| Gary frames this as neurosymbolic, but I think it's a bit of
| a fudge.
| krackers wrote:
| Hasn't the symbolic vs statistical split in AI existed for
| a long time? With things like Cyc growing out of the
| former. I'm not too familiar with linguistics but maybe
| this extends there too, since I think Chomsky was heavy on
| formal grammars over probabilistic models [1].
|
| Must be some sort of cognitive sunk cost fallacy, after
| dedicating your life to one sect, it must be emotionally
| hard to see the other "keep winning". Of course you'd root
| for them to fall.
|
| [1] https://norvig.com/chomsky.html
| mountainriver wrote:
| It's insane, he doesn't know what a test train split is but
| he's an AI expert? Is this where we are?
| marvinborner wrote:
| Is this supposed to be a joke reflecting point (3)?
| hrldcpr wrote:
| In case anyone else missed the original paper (and discussion):
|
| https://news.ycombinator.com/item?id=44203562
| dang wrote:
| Thanks! Macroexpanded:
|
| _The Illusion of Thinking: Strengths and limitations of
| reasoning models [pdf]_ -
| https://news.ycombinator.com/item?id=44203562 - June 2025 (269
| comments)
|
| Also this: _A Knockout Blow for LLMs?_ -
| https://news.ycombinator.com/item?id=44215131 - June 2025 (48
| comments)
|
| Were there others?
| avsteele wrote:
| This doesn't rebut anything from the best critique of the Apple
| paper.
|
| https://arxiv.org/abs/2506.09250
| Jabbles wrote:
| Those are points (2) and (5).
| foldr wrote:
| It does rebut point (1) of the abstract. Perhaps not
| convincingly, in your view, but it does directly addresses this
| kind of response.
| avsteele wrote:
| Papers make specific conclusions based on specific data. The
| paper I linked specifically rebuts the conclusions of the
| paper. Gary makes vague statements that could be interpreted
| as being related.
|
| It is scientific malpractice to write a post supposedly
| rebutting responses to a paper and not directly address the
| most salient one.
| foldr wrote:
| This sort of omission would not be considered scientific
| malpractice even in a journal article, let alone a blog
| post. A rebuttal of a position that fails to address the
| strongest arguments for it is a bad rebuttal, but it's not
| scientific malpractice to write a bad paper -- let alone a
| bad blog post.
|
| I don't think I agree with you that GM isn't addressing the
| points in the paper you link. But in any case, you're not
| doing your argument any favors by throwing in wild
| accusations of malpractice.
| avsteele wrote:
| Malpractice slightly hyperbolic.
|
| But anybody relying on Gary's posts in order to be be
| informed on this subject is being being mislead. This
| isn't an isolated incident either.
|
| People need to be made be aware when you read him it is
| mere punditry, not substantive engagement with the
| literature.
| spookie wrote:
| A paper citing arxiv papers and x.com doesn't pass my smell
| test tbh
| skywhopper wrote:
| The quote from the Salesforce paper is important: "agents
| displayed near-zero confidentiality awareness".
| bowsamic wrote:
| This doesn't address the primary issue: that they had no
| methodology for choosing puzzles that weren't in the training set
| and indeed while they claimed to have chosen puzzles that aren't
| they didn't explain why they think that. The whole point of the
| paper was to test LLM reasoning in untrained cases but there's no
| reason to expect such puzzles to not part of the training set,
| and if you don't have any way of telling if it is not or then
| your paper is not going to work out
| roywiggins wrote:
| Isn't it worse for LLMs if an LLM that has been trained on the
| Towers of Hanoi still can't solve it reliably?
| anonthrowawy wrote:
| how could you prove that?
| mentalgear wrote:
| AI hype-bros like to complain that real AI experts are too much
| concerned about debunking current AI then improving it - but the
| truth is that debunking bad AI IS improving AI. Science is a
| process of trial and error which only works by continuously
| questioning the current state.
| neepi wrote:
| Indeed. I completely agree with this.
|
| My objection to the whole thing is the AI hype bros, which is
| really the funding solicitation facade over everything rather
| the truth, only has one outcome and that is that it cannot be
| sustained. At that point all investor confidence disappears,
| the money is gone and everyone loses access to the tools that
| they suddenly built all their dependencies on because it's all
| proprietary service model based.
|
| Which is why I am not poking it with a 10 foot long shitty
| stick any time in the near future. The failure mode scares me,
| not the technology which arguably does have some use in non-
| idiot hands.
| wongarsu wrote:
| A lot of the best internet services came around in the decade
| after the dot-com crash. There is a chance Anthropic or
| OpenAI may not survive when funding suddenly dries up, but
| existing open weight models won't be majorly impacted. There
| will always be someone willing to host DeepSeek for you if
| you're willing to pay.
|
| And while it will be sad to see model improvements slow down
| when the bubble bursts there is a lot of untapped potential
| in the models we already have. Especially as they become
| cheaper and easier to run
| neepi wrote:
| Someone might host DeepSeek for you but you'll pay through
| the nose for it and it'll be frozen in time because the
| training cost doesn't have the revenue to keep the ball
| rolling.
|
| I'm not sure the GPU market won't collapse with it either.
| Possibly taking out a chunk of TSMC in the process, which
| will then have knock on effects across the whole industry.
| wongarsu wrote:
| There are already inference providers like DeepInfra or
| inference.net whose entire business model is hosted
| inference of open-source models. They promise not to keep
| or use any of the data and their business model has no
| scaling effects, so I assume they are already charging a
| fair market rate where the price covers the costs and
| returns a profit.
|
| The GPU market will probably take a hit. But the flip
| side of that is that the market will be flooded with
| second-hand enterprise-grade GPUs. And if Nvidia needs
| sales from consumer GPUs again we might see more
| attractive prices and configurations there too. In the
| short term a market shock might be great for hobby-scale
| inference, and maybe even training (at the 7B scale). In
| the long term it will hurt, but if all else fails we
| still have AMD who are somehow barely invested in this AI
| boom
| xoac wrote:
| Yeah this is history repeating. See for example less known
| "Dreyfuss affair" at MIT and the brilliantly titled books:
| "What Computers Can't Do" and its sequel "What Computers Still
| Can't Do".
| bobxmax wrote:
| > AI hype-bros like to complain that real AI experts are too
| much concerned about debunking current AI then improving it
|
| You're acting like this is a common ocurrence lol
| 3abiton wrote:
| To hammer one point though, you have to understand that
| researcher are desensitized to minor novel improvement that
| translate to great value products. While obviously studying and
| assessing the limitations of AI is crucial, to the general
| public its capabilities are just so amazing, they can't fathom
| why we should think about limitations. Optimizing what we have
| is bette than rethinking the whole process.
| dang wrote:
| Can you please make your substantive points without name-
| calling or swipes? This is in the site guidelines:
| https://news.ycombinator.com/newsguidelines.html.
| labrador wrote:
| The key insight is that LLMs can 'reason' when they've seen
| similar solutions in training data, but this breaks down on truly
| novel problems. This isn't reasoning exactly, but close enough to
| be useful in many circumstances. Repeating solutions on demand
| can be handy, just like repeating facts on demand is handy.
| Marcus gets this right technically but focuses too much on
| emotional arguments rather than clear explanation.
| Jabrov wrote:
| I'm so tired of hearing this be repeated, like the whole "LLMs
| are _just_ parrots" thing.
|
| It's patently obvious to me that LLMs can reason and solve
| novel problems not in their training data. You can test this
| out in so many ways, and there's so many examples out there.
|
| ______________
|
| Edit for responders, instead of replying to each:
|
| We obviously have to define what we mean by "reasoning" and
| "solving novel problems". From my point of view, reasoning !=
| general intelligence. I also consider reasoning to be a
| spectrum. Just because it cannot solve the hardest problem you
| can think of does not mean it cannot reason at all. Do note, I
| think LLMs are generally pretty bad at reasoning. But I
| disagree with the point that LLMs cannot reason at all or never
| solve any novel problems.
|
| In terms of some backing points/examples:
|
| 1) Next token prediction can itself be argued to be a task that
| requires reasoning
|
| 2) You can construct a variety of language translation tasks,
| with completely made up languages, that LLMs can complete
| successfully. There's tons of research about in-context
| learning and zero-shot performance.
|
| 3) Tons of people have created all kinds of
| challenges/games/puzzles to prove that LLMs can't reason. One
| by one, they invariably get solved (eg. https://gist.github.com
| /VictorTaelin/8ec1d8a0a3c87af31c25224...,
| https://ahmorse.medium.com/llms-and-reasoning-part-i-the-
| mon...) -- sometimes even when the cutoff date for the LLM is
| before the puzzle was published.
|
| 4) Lots of examples of research about out-of-context reasoning
| (eg. https://arxiv.org/abs/2406.14546)
|
| In terms of specific rebuttals to the post:
|
| 1) Even though they start to fail at some complexity threshold,
| it's incredibly impressive that LLMs can solve any of these
| difficult puzzles at all! GPT3.5 couldn't do that. We're making
| incremental progress in terms of reasoning. Bigger, smarter
| models get better at zero-shot tasks, and I think that
| correlates with reasoning.
|
| 2) Regarding point 4 ("Bigger models might to do better"): I
| think this is very dismissive. The paper itself shows a huge
| variance in the performance of different models. For example,
| in figure 8, we see Claude 3.7 significantly outperforming
| DeepSeek and maintaining stable solutions for a much longer
| sequence length. Figure 5 also shows that better models and
| more tokens improve performance at "medium" difficulty
| problems. Just because it cannot solve the "hard" problems does
| not mean it cannot reason at all, nor does it necessarily mean
| it will never get there. Many people were saying we'd never be
| able to solve problems like the medium ones a few years ago,
| but now the goal posts have just shifted.
| labrador wrote:
| I've done this excercise dozens of times because people keep
| saying it, but I can't find an example where this is true. I
| wish it was. I'd be solving world problems with novel
| solutions right now.
|
| People make a common mistake by conflating "solving problems
| with novel surface features" with "reasoning outside training
| data." This is exactly the kind of binary thinking I
| mentioned earlier.
| lossolo wrote:
| They can't create anything novel and it's patently obvious if
| you understand how they're implemented. But I'm just some
| anonymous guy on HN, so maybe this time I will just cite the
| opinion of the DeepMind CEO, who said in a recent interview
| with The Verge (available on YouTube) that LLMs based on
| transformers can't create anything truly novel.
| labrador wrote:
| "I don't think today's systems can invent, you know, do
| true invention, true creativity, hypothesize new scientific
| theories. They're extremely useful, they're impressive, but
| they have holes."
|
| Demis Hassabis On The Future of Work in the Age of AI (@
| 2:30 mark)
|
| https://www.youtube.com/watch?v=CRraHg4Ks_g
| lossolo wrote:
| Yes, this one. Thanks
| bfung wrote:
| Any links or examples available? Curious to try it out
| multjoy wrote:
| Lol, no.
| aucisson_masque wrote:
| > It's patently obvious that LLMs can reason and solve novel
| problems not in their training data.
|
| Would you care to tell us more ?
|
| << It's patently obvious >> is not really an argument, I
| could say just as well that everyone know LLM can't resonate
| or think (in the way we living beings do).
| andrewmcwatters wrote:
| It's definitely not true in any meaningful sense. There are
| plenty of us practitioners in software engineering wishing it
| was true, because if it was, we'd all have genius interns
| working for us on Mac Studios at home.
|
| It's not true. It's plainly not true. Go have any of these
| models, paid, or local try to build you novel solutions to
| hard, existing problems despite being, in some cases, trained
| on literally the entire compendium of open knowledge in not
| just one, but multiple adjacent fields. Not to mention the
| fact that being able to abstract general knowledge would mean
| it _would_ be able to reason.
|
| They. Cannot. Do it.
|
| I have no idea what you people are talking about because you
| cannot be working on anything with real substance that hasn't
| been perfectly line fit to your abundantly worked on
| problems, but no, these models are obviously not reasoning.
|
| I built a digital employee and gave it menial tasks that
| compare to current cloud solutions who also claim to be able
| to provide you paid cloud AI employees and these things are
| stupider than fresh college grads.
| goalieca wrote:
| So far they cannot even answer questions which are straight
| up fact checking and search engine like queries. Reasoning
| means they would be able to work through a problem and
| generate a proof they way a student might.
| swat535 wrote:
| If that was the case, it would have been great already but
| these tools can't even do that. They frequently make mistake
| repeating the same solutions available everywhere during their
| "reasoning" process and fabricates plausible hallucinations
| which you then have to inspect carefully to catch.
| aucisson_masque wrote:
| That's the opposite of reasoning tho. Ai bros want to make
| people believe LLM are smart but they're not capable of
| intelligence and reasoning.
|
| Reasoning mean you can take on a problem you've never seen
| before and think of innovative ways to solve it.
|
| LLM can only replicate what is in its data, it can in no way
| think or guess or estimate what will likely be the best
| solution, it can only output a solution based on a probability
| calculation made on how frequent it has seen this solution
| linked to this problem.
| labrador wrote:
| You're assuming we're saying LLMs can't reason. That's not
| what we're saying. They can execute reasoning-like processes
| when they've seen similar patterns, but this breaks down when
| true novel reasoning is required. Most people do the same
| thing. Some poeple can come up with novel solutions to new
| problems, but LLMs will choke. Here's an example:
|
| Prompt: "Let's try a reasoning test. Estimate how many pianos
| there are at the bottom of the sea."
|
| I tried this on three advanced AIs* and they all choked on it
| without further hints from me. Claude then said:
| Roughly 3 million shipwrecks on ocean floors globally
| Maybe 1 in 1000 ships historically carried a piano (passenger
| ships, luxury vessels) So ~3,000 ships with pianos
| sunk Average maybe 0.5 pianos per ship (not all
| passenger areas had them) Estimate: ~1,500 pianos
|
| *Claude Sonnet 4, Google Gemini 2.5 and GPT 4o
| Jabrov wrote:
| That seems like a totally reasonable response ... ?
| ummonk wrote:
| Most of the objections and their counterarguments seem like
| either poor objections (e.g. ad hominem against the first listed
| author) or seem to be subsumed under point 5. It's annoying that
| most of this post focuses so much effort on discussing most of
| the other objections when the important discussion is the one to
| be had in point 5:
|
| I.e. to what extent are LLMs able to reliably make use of writing
| code or using logic systems, and to what extent does
| hallucinating / providing faulty answers in the absence of such
| tool access demonstrate an inability to truly reason (I'd expect
| a smart human to just say "that's too much" or "that's beyond my
| abilities" rather than do a best effort faulty answer)?
| thomasahle wrote:
| > I'd expect a smart human to just say "that's too much" or
| "that's beyond my abilities" rather than do a best effort
| faulty answer)?
|
| That's what the models did. They gave the first 100 steps, then
| explained how it was too much to output all of it, and gave the
| steps one would follow to complete it.
|
| They were graded as "wrong answer" for this.
|
| ---
|
| Source:
| https://x.com/scaling01/status/1931783050511126954?t=ZfmpSxH...
|
| > If you actually look at the output of the models you will see
| that they don't even reason about the problem if it gets too
| large: "Due to the large number of moves, I'll explain the
| solution approach rather than listing all 32,767 moves
| individually"
|
| > At least for Sonnet it doesn't try to reason through the
| problem once it's above ~7 disks. It will state what the
| problem and the algorithm to solve it and then output its
| solution without even thinking about individual steps.
| FINDarkside wrote:
| I don't think most of the objections are poor at all apart from
| 3, it's this article that seems to make lots of strawmans.
| Especially the first objection is often heard because people
| claim "this paper proves LLMs don't reason". The author moves
| goalposts and is arguing against about whether LLMs lead to
| AGI, which is already a strawman for those arguments. And in
| addition, he even seems to misunderstand AGI, thinking it's
| some sort of super intelligence ("We have every right to expect
| machines to do things we can't"). AI that can do everything at
| least as good as average human is AGI by definition.
|
| It's especially weird argument considering that LLMs are
| already ahead of humans in Tower of Hanoi. I bet average person
| will not be able to "one-shot" you the moves to 8 disk tower of
| Hanoi without writing anything down or tracking the state with
| the actual disks. LLMs have far bigger obstacles to reaching
| AGI though.
|
| 5 is also a massive strawman with the "not see how well it
| could use preexisting code retrieved from the web" as well,
| given that these models will write code to solve these kind of
| problems even if you come up with some new problem that
| wouldn't exist in its training data.
| wohoef wrote:
| Good article giving some critique to Apple's paper and Gary
| Marcus specifically.
|
| https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...
| hintymad wrote:
| Honest question: does the opinion of Gary Marcus still count?
| His criticism seems more philosophical than scientific. It's
| hard for me see what he builds or reasons to get to his
| conclusions.
| brcmthrowaway wrote:
| In classic ML, you never evaluste against data that was in the
| training set. In LLMs, everything is the training set. Doesn't
| this seem wrong?
| thomasahle wrote:
| > 1. Humans have trouble with complex problems and memory
| demands. True! But incomplete. We have every right to expect
| machines to do things we can't. [...] If we want to get to AGI,
| we will have to better.
|
| I don't get this argument. The paper is about "whether RLLMs can
| think". If we grant "humans make these mistakes too", but also
| "we still require this ability in our definition of thinking",
| aren't we saying "thinking in humans is a illusion" too?
| thomasahle wrote:
| > 5. A student might complain about a math exam requiring
| integration or differentiation by hand, even though math software
| can produce the correct answer instantly. The teacher's goal in
| assigning the problem, though, isn't finding the answer to that
| question (presumably the teacher already know the answer), but to
| assess the student's conceptual understanding. Do LLM's
| conceptually understand Hanoi? That's what the Apple team was
| getting at. (Can LLMs download the right code? Sure. But
| downloading code without conceptual understanding is of less help
| in the case of new problems, dynamically changing environments,
| and so on.)
|
| Why is he talking about "downloading" code? The LLMs can easily
| "write" out out the code themselves.
|
| If the student wrote a software program for general
| differentiation during the exam, they obviously would have a
| great conceptual understanding.
___________________________________________________________________
(page generated 2025-06-14 23:00 UTC)