[HN Gopher] Is chain-of-thought AI reasoning a mirage?
___________________________________________________________________
Is chain-of-thought AI reasoning a mirage?
Author : ingve
Score : 117 points
Date : 2025-08-14 13:48 UTC (9 hours ago)
(HTM) web link (www.seangoedecke.com)
(TXT) w3m dump (www.seangoedecke.com)
| NitpickLawyer wrote:
| Finally! A good take on that paper. I saw that arstechnica
| article posted everywhere, and most of the comments are full of
| confirmation bias, and almost all of them miss the fineprint - it
| was tested on a 4 layer deep toy model. It's nice to read a post
| that actually digs deeper and offers perspectives on what might
| be a good finding vs. just warranting more research.
| stonemetal12 wrote:
| > it was tested on a 4 layer deep toy model
|
| How do you see that impacting the results? It is the same
| algorithm just on a smaller scale. I would assume a 4 layer
| model would not be very good, but does reasoning improve it? Is
| there a reason scale would impact the use of reasoning?
| okasaki wrote:
| Human babies are the same algorithm as adults.
| azrazalea_debt wrote:
| A lot of current LLM work is basically emergent behavior.
| They use a really simple core algorithm and scale it up, and
| interesting things happen. You can read some of anthropic's
| recent papers to see some of this, like: They didn't expect
| LLMs could "lookahead" when writing poetry. However, when
| they actually went in and watched what was happening (there's
| details on how this "watching" works on their blog/in their
| studies) they found the LLM actually was planning ahead!
| That's emergent behavior, they didn't design it to do that,
| it just started doing due to the complexity of the model.
|
| If (BIG if) we ever do see actual AGI, it is likely to work
| like this. It's unlikely we're going to make AGI by designing
| some grand Cathedral of perfect software, it is more likely
| we are going to find the right simple principles to scale big
| enough to have AGI emerge. This is similar.
| mrspuratic wrote:
| On that topic, it seems backwards to me: intelligence is
| not emergent behaviour of language, rather the opposite.
| NitpickLawyer wrote:
| There's prior research that finds a connection between model
| depth and "reasoning" ability -
| https://arxiv.org/abs/2503.03961
|
| A depth of 4 is very small. It is very much a toy model. It's
| _ok_ to research this, and maybe someone will try it out on
| larger models, but it 's totally _not_ ok to lead with the
| conclusion, based on this toy model, IMO.
| sempron64 wrote:
| Betteridge's Law of Headlines.
|
| https://en.m.wikipedia.org/wiki/Betteridge's_law_of_headline...
| mwkaufma wrote:
| Betteridge's law applies to editors adding question marks to
| cover-the-ass of articles with weak claims, not bloggers
| begging questions.
| robviren wrote:
| I feel it is interesting but not what would be ideal. I really
| think if the models could be less linear and process over time in
| latent space you'd get something much more akin to thought. I've
| messed around with attaching reservoirs at each layer using hooks
| with interesting results (mainly over fitting), but it feels like
| such a limitation to have all model context/memory stuck as
| tokens when latent space is where the richer interaction lives.
| Would love to see more done where thought over time mattered and
| the model could almost mull over the question a bit before being
| obligated to crank out tokens. Not an easy problem, but
| interesting.
| dkersten wrote:
| Agree! I'm not an AI engineer or researcher, but it always
| struck me as odd that we would serialise the 100B or whatever
| parameters of latent space down to maximum 1M tokens and back
| for every step.
| vonneumannstan wrote:
| >I feel it is interesting but not what would be ideal. I really
| think if the models could be less linear and process over time
| in latent space you'd get something much more akin to thought.
|
| Please stop, this is how you get AI takeovers.
| adastra22 wrote:
| Citation seriously needed.
| CuriouslyC wrote:
| They're already implementing branching thought and taking the
| best one, eventually the entire response will be branched, with
| branches being spawned and culled by some metric over the
| lifetime of the completion. It's just not feasible now for
| performance reasons.
| mentalgear wrote:
| > Whether AI reasoning is "real" reasoning or just a mirage can
| be an interesting question, but it is primarily a philosophical
| question. It depends on having a clear definition of what "real"
| reasoning is, exactly.
|
| It's pretty easy: causal reasoning. Causal, not statistic
| correlation only as LLM do, with or without "CoT".
| naasking wrote:
| Define causal reasoning?
| glial wrote:
| Correct me if I'm wrong, I'm not sure it's so simple. LLMs are
| called causal models in the sense that earlier tokens "cause"
| later tokens, that is, later tokens are causally dependent on
| what the earlier tokens are.
|
| If you mean deterministic rather than probabilistic, even
| Pearl-style causal models are probabilistic.
|
| I think the author is circling around the idea that their idea
| of reasoning is to produce statements in a formal system: to
| have a set of axioms, a set of production rules, and to
| generate new strings/sentences/theorems using those rules. This
| approach is how math is formalized. It allows us to extrapolate
| - make new "theorems" or constructions that weren't in the
| "training set".
| jayd16 wrote:
| By this definition a bag of answers is causal reasoning
| because we previously filled the bag, which caused what we
| pulled. State causing a result is not causal reasoning.
|
| You need to actually have something that deduces a result
| from a set of principles that form a logical conclusion or
| the understanding that more data is needed to make a
| conclusion. That is clearly different than finding a likely
| next token on statics alone, despite the fact the statical
| answer can be correct.
| apples_oranges wrote:
| But let's say you change your mathematical expression by
| reducing or expanding it somehow, then, unless it's trivial,
| there are infinite ways to do it, and the "cause" here is the
| answer to the question of "why did you do that and not
| something else"? Brute force excluded, the cause is probably
| some idea, some model of the problem or a gut feeling (or
| desperation..) ..
| stonemetal12 wrote:
| Smoking increases the risk of getting cancer significantly.
| We say Smoking causes Cancer. Causal reasoning can be
| probabilistic.
|
| LLMs are not causal reasoning because there are no facts,
| only tokens. For the most part you can't ask LLMs how they
| came to an answer, because it doesn't know.
| lordnacho wrote:
| What's stopping us from building an LLM that can build causal
| trees, rejecting some trees and accepting others based on
| whatever evidence it is fed?
|
| Or even a causal tool for an LLM agent that operates like what
| it does when you ask it about math and forwards the request to
| Wolfram.
| suddenlybananas wrote:
| >What's stopping us from building an LLM that can build
| causal trees, rejecting some trees and accepting others based
| on whatever evidence it is fed?
|
| Exponential time complexity.
| mdp2021 wrote:
| > _causal reasoning_
|
| You have missed the foundation: before dynamics, being. Before
| causal reasoning you have deep definition of concepts.
| Causality is "below" that.
| empath75 wrote:
| One thing that LLMs have exposed is how much of a house of cards
| all of our definitions of "human mind"-adjacent concepts are. We
| have a single example in all of reality of a being that thinks
| like we do, and so all of our definitions of thinking are
| inextricably tied with "how humans think", and now we have an
| entity that does things which seem to be very like how we think,
| but not _exactly like it_, and a lot of our definitions don't
| seem to work any more:
|
| Reasoning, thinking, knowing, feeling, understanding, etc.
|
| Or at the very least, our rubrics and heuristics for determining
| if someone (thing) thinks, feels, knows, etc, no longer work. And
| in particular, people create tests for those things thinking that
| they understand what they are testing for, when _most human
| beings_ would also fail those tests.
|
| I think a _lot_ of really foundational work needs to be done on
| clearly defining a lot of these terms and putting them on a
| sounder basis before we can really move forward on saying whether
| machines can do those things.
| gdbsjjdn wrote:
| Congratulations, you've invented philosophy.
| empath75 wrote:
| This is an obnoxious response. Of course I recognize that
| philosophy is the solution to this. What I am pointing out is
| that philosophy has not as of yet resolved these relatively
| new problems. The idea that non-human intelligences might
| exist is of course an old one, but that is different from
| having an actual (potentially) existing one to reckon with.
| adastra22 wrote:
| These are not new problems though.
| deadbabe wrote:
| Non-human intelligences have always existed in the form of
| animals.
|
| Animals do not have spoken language the way humans do, so
| their thoughts aren't really composed of sentences. Yet,
| they have intelligence and can reason about their world.
|
| How could we build an AGI that doesn't use language to
| think at all? We have no fucking clue and won't for a while
| because everyone is chasing the mirage created by LLMs. AI
| winter will come and we'll sit around waiting for the next
| big innovation. Probably some universal GOAP with deeply
| recurrent neural nets.
| gdbsjjdn wrote:
| > Writings on metacognition date back at least as far as
| two works by the Greek philosopher Aristotle (384-322 BC):
| On the Soul and the Parva Naturalia
|
| We built a box that spits out natural language and tricks
| humans into believing it's conscious. The box itself
| actually isn't that interesting, but the human side of the
| equation is.
| mdp2021 wrote:
| > _the human side of the equation is_
|
| You have only proven the urgency of Intelligence, the
| need to produce it in inflationary amounts.
| meindnoch wrote:
| We need to reinvent philosophy. With JSON this time.
| mdp2021 wrote:
| > _which seem to be very like how we think_
|
| I would like to reassure you that we - we here - see LLMs are
| very much unlike us.
| empath75 wrote:
| Yes I very much understand that most people do not think that
| LLMs think or understand like we do, but it is _very
| difficult_ to prove that that is the case, using any test
| which does not also exclude a great deal of people. And that
| is because "thinking like we do" is not at all a well-defined
| concept.
| mdp2021 wrote:
| > _exclude a great deal of people_
|
| And why should you not exclude them. Where does this idea
| come from, taking random elements as models. Where do you
| see pedestals of free access? Is the Nobel Prize a raffle
| now?
| gilbetron wrote:
| I agree 100% with you. I'm most excited about LLMs because they
| seem to capture at least some aspect of intelligence, and
| that's amazing given how much long it took to get here. It's
| exciting that we don't just understand it.
|
| I see people say, "LLMs aren't human intelligence", but
| instead, I really feel that it shows that many people, and much
| of what we do, probably is like an LLM. Most people just
| hallucinate their way through a conversation, they certainly
| don't reason. Reasoning is incredibly rare.
| naasking wrote:
| > Because reasoning tasks require choosing between several
| different options. "A B C D [M1] -> B C D E" isn't reasoning,
| it's computation, because it has no mechanism for thinking "oh, I
| went down the wrong track, let me try something else". That's why
| the most important token in AI reasoning models is "Wait". In
| fact, you can control how long a reasoning model thinks by
| arbitrarily appending "Wait" to the chain-of-thought. Actual
| reasoning models change direction all the time, but this paper's
| toy example is structurally incapable of it.
|
| I think this is the most important critique that undercuts the
| paper's claims. I'm less convinced by the other point. I think
| backtracking and/or parallel search is something future papers
| should definitely look at in smaller models.
|
| The article is definitely also correct on the overreaching, broad
| philosophical claims that seems common when discussing AI and
| reasoning.
| mucho_mojo wrote:
| This paper I read from here has an interesting mathematical model
| for reasoning based on cognitive science.
| https://arxiv.org/abs/2506.21734 (there is also code here
| https://github.com/sapientinc/HRM) I think we will see dramatic
| performance increases on "reasoning" problems when this is worked
| into existing AI architectures.
| stonemetal12 wrote:
| When Using AI they say "Context is King". "Reasoning" models are
| using the AI to generate context. They are not reasoning in the
| sense of logic, or philosophy. Mirage, whatever you want to call
| it, it is rather unlike what people mean when they use the term
| reasoning. Calling it reasoning is up there with calling
| generating out put people don't like hallucinations.
| adastra22 wrote:
| You are making the same mistake OP is calling out. As far as I
| can tell "generating context" is exactly what human reasoning
| is too. Consider the phrase "let's reason this out" where you
| then explore all options in detail, before pronouncing your
| judgement. Feels exactly like what the AI reasoner is doing.
| stonemetal12 wrote:
| "let's reason this out" is about gathering all the facts you
| need, not just noting down random words that are related. The
| map is not the terrain, words are not facts.
| energy123 wrote:
| Performance is proportional to the number of reasoning
| tokens. How to reconcile that with your opinion that they
| are "random words"?
| kelipso wrote:
| Technically random can have probabilities associated with
| them.. Casual speech, random means equal probabilities,
| or we don't know the probabilities. But for LLM token
| output, it does estimate the probabilities.
| blargey wrote:
| s/random/statistically-likely/g
|
| Reducing the distance of each statistical leap improves
| "performance" since you would avoid failure modes that
| are specific to the largest statistical leaps, but it
| doesn't change the underlying mechanism. Reasoning models
| still "hallucinate" spectacularly even with "shorter"
| gaps.
| ikari_pl wrote:
| What's wrong with statistically likely?
|
| If I ask you what's 2+2, there's a single answer I
| consider much more likely than others.
|
| Sometimes, words are likely because they are grounded in
| ideas and facts they represent.
| blargey wrote:
| > Sometimes, words are likely because they are grounded
| in ideas and facts they represent.
|
| Yes, and other times they are not. I think the failure
| modes of a statistical model of a communicative model of
| thought are unintuitive enough without any added layers
| of anthropomorphization, so there remains some value in
| pointing it out.
| CooCooCaCha wrote:
| Reasoning is also about _processing_ facts.
| ThrowawayTestr wrote:
| Have you read the chain of thought output from reasoning
| models? That's not what it does.
| mdp2021 wrote:
| But a big point here becomes whether the generated "context"
| then receives proper processing.
| slashdave wrote:
| Perhaps we can find some objective means to decide, rather
| than go with what "feels" correct
| phailhaus wrote:
| Feels like, but isn't. When you are reasoning things out,
| there is a brain with state that is actively modeling the
| problem. AI does no such thing, it produces text and then
| uses that text to condition the next text. If it isn't
| written, it does not exist.
|
| Put another way, LLMs are good at talking like they are
| thinking. That can get you pretty far, but it is not
| reasoning.
| double0jimb0 wrote:
| So exactly what language/paradigm is this brain modeling
| the problem within?
| phailhaus wrote:
| We literally don't know. We don't understand how the
| brain stores concepts. It's not necessarily language:
| there are people that do not have an internal monologue,
| and yet they are still capable of higher level thinking.
| chrisweekly wrote:
| Rilke: "There is a depth of thought untouched by words,
| and deeper still a depth of formless feeling untouched by
| thought."
| Enginerrrd wrote:
| The transformer architecture absolutely keeps state
| information "in its head" so to speak as it produces the
| next word prediction, and uses that information in its
| compute.
|
| It's true that if it's not producing text, there is no
| thinking involved, but it is absolutely NOT clear that the
| attention block isn't holding state and modeling something
| as it works to produce text predictions. In fact, I can't
| think of a way to define it that would make that untrue...
| unless you mean that there isn't a system wherein something
| like attention is updating/computing and the model itself
| _chooses_ when to make text predictions. That 's by design,
| but what you're arguing doesn't really follow.
|
| Now, whether what the model is thinking about inside that
| attention block matches up exactly or completely with the
| text it's producing as generated context is probably at
| least a little dubious, and its unlikely to be a complete
| representation regardless.
| dmacfour wrote:
| > The transformer architecture absolutely keeps state
| information "in its head" so to speak as it produces the
| next word prediction, and uses that information in its
| compute.
|
| How so? Transformers are state space models.
| kelipso wrote:
| No, people make logical connections, make inferences, make
| sure all of it fits together without logical errors, etc.
| pixl97 wrote:
| These people you're talking about must be rare online, as
| human communication is pretty rife with logical errors.
| mdp2021 wrote:
| Since that November in which this technology boomed we
| have been much too often reading "people also drink from
| puddles", as if it were standard practice.
|
| That we implement skills, not deficiencies, is a basic
| concept that is getting to such a level of needed
| visibility it should probably be inserted in the
| guidelines.
|
| _We implement skills, not deficiencies._
| kelipso wrote:
| You shouldn't be basing your entire worldview around the
| lowest common denominator. All kinds of writers like blog
| writers, novelists, scriptwriters, technical writers,
| academics, poets, lawyers, philosophers, mathematicians,
| and even teenage fan fiction writers do what I said above
| routinely.
| viccis wrote:
| >As far as I can tell "generating context" is exactly what
| human reasoning is too.
|
| This was the view of Hume (humans as bundles of experience
| who just collect information and make educated guesses for
| everything). Unfortunately, it leads to philosophical
| skepticism, in which you can't ground any knowledge
| absolutely, as it's all just justified by some knowledge you
| got from someone else, which also came from someone else,
| etc., and eventually you can't actually justify any knowledge
| that isn't directly a result of experience (the concept of
| "every effect has a cause" is a classic example).
|
| There have been plenty of epistemological responses to this
| viewpoint, with Kant's view, of humans doing a mix of
| "gathering context" (using our senses) but also applying
| universal categorical reasoning to schematize and understand
| / reason from the objects we sense, being the most well
| known.
|
| I feel like anyone talking about the epistemology of AI
| should spend some time reading the basics of all of the
| thought from the greatest thinkers on the subject in
| history...
| js8 wrote:
| > I feel like anyone talking about the epistemology of AI
| should spend some time reading the basics
|
| I agree, I think the problem with AI is we don't know or
| haven't formalized enough what epistemology should AGI
| systems have. Instead, people are looking for shortcuts,
| feeding huge amount of data into the models, hoping it will
| self-organize into something that humans actually want.
| bongodongobob wrote:
| And yet it improves their problem solving ability.
| ofjcihen wrote:
| It's incredible to me that so many seem to have fallen for
| "humans are just LLMs bruh" argument but I think I'm beginning
| to understand the root of the issue.
|
| People who only "deeply" study technology only have that frame
| of reference to view the world so they make the mistake of
| assuming everything must work that way, including humans.
|
| If they had a wider frame of reference that included, for
| example, Early Childhood Development, they might have enough
| knowledge to think outside of this box and know just how
| ridiculous that argument is.
| gond wrote:
| That is an issue prevalent in the western world for the last
| 200 years, beginning possibly with the Industrial Revolution,
| probably earlier. That problem is reductionism, consequently
| applied down to the last level: discover the smallest element
| of every field of science, develop an understanding of all
| the parts from the smallest part upwards and develop, from
| the understanding of the parts, an understanding of the
| whole.
|
| Unfortunately, this approach does not yield understanding, it
| yields know-how.
| Kim_Bruning wrote:
| Taking things apart to see how they tick is called
| reduction, but (re)assembling the parts is emergence.
|
| When you reduce something to its components, you lose
| information on how the components work together. Emergence
| 'finds' that information back.
|
| Compare differentiation and integration, which lose and
| gain terms respectively.
|
| In some cases, I can imagine differentiating and
| integrating certain functions actually would even be a
| direct demonstration of reduction and emergence.
| dmacfour wrote:
| I have a background in ML and work in software development,
| but studied experimental psych in a past life. It's actually
| kind of painful watching people slap phases related to
| cognition onto things that aren't even functionally
| equivalent to their namesakes, then parade them around like
| some kind of revelation. It's also a little surprising that
| there no interest (at least publicly) in using cognitive
| architectures in the development of AI systems.
| cyanydeez wrote:
| They should call them Fuzzing models. They're just running
| through varioous iterations of the context until they hit a
| token that trips them out.
| benreesman wrote:
| People will go to extremely great lengths to debate the
| appropriate analogy for how these things work, which is fun I
| guess but in a "get high with a buddy" sense at least to my
| taste.
|
| Some of how they work is well understood (a lot now, actually),
| some of the outcomes are still surprising.
|
| But we debate both the well understood parts and the surprising
| parts _both_ with the wrong terminology borrowed from pretty
| dubious corners of pop cognitive science, and not with
| terminology appropriate to the new and different thing! It 's
| nothing like a brain, it's a new different thing. Does it think
| or reason? Who knows pass the blunt.
|
| They do X performance on Y task according to Z eval, that's how
| you discuss ML model capability if you're persuing
| understanding rather than fundraising or clicks.
| Vegenoid wrote:
| While I largely agree with you, more abstract judgements must
| be made as the capabilities (and therefore tasks being
| completed) become increasingly general. Attempts to boil
| human intellectual capability down to "X performance on Y
| task according to Z eval" can be useful, but are famously
| incomplete and insufficient on their own for making good
| decisions about which humans (a.k.a. which general
| intelligences) are useful and how to utilize and improve
| them. Boiling down highly complex behavior into a small
| number of metrics loses a lot of detail.
|
| There is also the desire to discover _why_ a model that
| outperforms others does so, so that the successful technique
| can be refined and applied elsewhere. This too usually
| requires more approaches than metric comparison.
| moc_was_wronged wrote:
| Mostly. It gives language models the way to dynamically allocate
| computation time, but the models are still fundamentally
| imitative.
| modeless wrote:
| "The question [whether computers can think] is just as relevant
| and just as meaningful as the question whether submarines can
| swim." -- Edsger W. Dijkstra, 24 November 1983
| mdp2021 wrote:
| But the topic here is whether some techniques are progressive
| or not
|
| (with a curious parallel about whether some paths in thought
| are dead-ends - the unproductive focus mentioned in the
| article).
| griffzhowl wrote:
| I don't agree with the parallel. Submarines can move through
| water - whether you call that swimming or not isn't an
| interesting question, and doesn't illuminate the function of a
| submarine.
|
| With thinking or reasoning, there's not really a precise
| definition of what it is, but we nevertheless know that
| currently LLMs and machines more generally can't reproduce many
| of the human behaviours that we refer to as thinking.
|
| The question of what tasks machines can currently accomplish is
| certainly meaningful, if not urgent, and the reason LLMs are
| getting so much attention now is that they're accomplishing
| tasks that machines previously couldn't do.
|
| To some extent there might always remain a question about
| whether we call what the machine is doing "thinking" - but
| that's the uninteresting verbal question. To get at the
| meaningful questions we might need a more precise or higher
| resolution map of what we mean by thinking, but the crucial
| element is what functions a machine can perform, what tasks it
| can accomplish, and whether we call that "thinking" or not
| doesn't seem important.
|
| Maybe that was even Dijkstra's point, but it's hard to tell
| without context...
| wizzwizz4 wrote:
| https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD898.
| .. provides the context. I haven't re-read it in the last
| month, but I'm pretty sure you've correctly identified
| Dijkstra's point.
| modeless wrote:
| It is strange that you started your comment with "I don't
| agree". The rest of the comment demonstrates that you do
| agree.
| griffzhowl wrote:
| To be more clear about why I disagree the cases are
| parallel:
|
| We know how a submarine moves through water, whether it's
| "swimming" isn't an interesting question.
|
| We don't know to what extent a machine can reproduce the
| cognitive functions of a human. There are substantive and
| significant questions about whether or to what extent a
| particular machine or program can reproduce human cognitive
| functions.
|
| So I might have phrased my original comment badly. It
| doesn't matter if we use the word "thinking" or not, but it
| does matter if a machine can reproduce the human cognitive
| functions, and if that's what we mean by the question
| whether a machine can think, then it does matter.
| modeless wrote:
| "We know how it moves" is not the reason the question of
| whether a submarine swims is not interesting. It's
| because the question is mainly about the definition of
| the word "swim" rather than about capabilities.
|
| > if that's what we mean by the question whether a
| machine can think
|
| That's the issue. The question of whether a machine can
| think (or reason) is a question of word definitions, not
| capabilities. The capabilities questions are the ones
| that matter.
| griffzhowl wrote:
| > The capabilities questions are the ones that matter.
|
| Yes, that's what I'm saying. I also think there's a clear
| sense in which asking whether machines can think is a
| question about capabilities, even though we would need a
| more precise definition of "thinking" to be able to
| answer it.
|
| So that's how I'd sum it up: we know the capabilities of
| submarines, and whether we say they're swimming or not
| doesn't answer any further question about those
| capabilities. We don't know the capabilities of machines;
| the interesting questions are about what they can do, and
| one (imprecise) way of asking that question is whether
| they can think
| skybrian wrote:
| Mathematical reasoning does sometimes require correct
| calculations, and if you get them wrong your answers will be
| wrong. I wouldn't want someone doing my taxes to be bad at
| calculation or bad at finding mistakes in calculation.
|
| It would be interesting to see if this study's results can be
| reproduced in a more realistic setting.
| slashdave wrote:
| > reasoning probably requires language use
|
| The author has a curious idea of what "reasoning" entails.
| sixdimensional wrote:
| I feel like the fundamental concept of symbolic logic[1] as a
| means of reasoning fits within the capabilities of LLMs.
|
| Whether it's a mirage or not, the ability to produce a
| symbolically logical result that has valuable meaning seems real
| enough to me.
|
| Especially since most meaning is assigned by humans onto the
| world... so too can we choose to assign meaning (or not) to the
| output of a chain of symbolic logic processing?
|
| Edit: maybe it is not so much that an LLM calculates/evaluates
| the result of symbolic logic as it is that it "follows" the
| pattern of logic encoded into the model.
|
| [1] https://en.wikipedia.org/wiki/Logic
| lawrence1 wrote:
| we should be asking if reasoning while speaking is even possible
| for humans. this is why we have the scientific method and that's
| why LLMs write and run unit tests on their reasoning. But yeah
| intelligence is probably in the ear of the believer.
| hungmung wrote:
| Chain of thought is just a way of trying to squeeze more juice
| out of the lemon of LLM's; I suspect we're at the stage of
| running up against diminishing returns and we'll have to move to
| different foundational models to see any serious improvement.
| brunokim wrote:
| I'm unconvinced by the article criticism's, given they also
| employ their feels and few citations.
|
| > I appreciate that research has to be done on small models, but
| we know that reasoning is an emergent capability! (...) Even if
| you grant that what they're measuring is reasoning, I am
| profoundly unconvinced that their results will generalize to a
| 1B, 10B or 100B model.
|
| A fundamental part of applied research is simplifying a real-
| world phenomenon to better understand it. Dismissing that for
| this many parameters, for such a simple problem, the LLM can't
| perform out of distribution just because it's not big enough
| undermines the very value of independent research. Tomorrow
| another model with double the parameters may or may not show the
| same behavior, but that finding will be built on top of this one.
|
| Also, how do _you_ know that reasoning is emergent, and not
| rationalising on top of a compressed version of the web stored in
| 100B parameters?
| ActionHank wrote:
| I think that when you are arguing logic and reason with a group
| who became really attached to the term vibe-coding you've
| likely already lost.
| LudwigNagasena wrote:
| > The first is that reasoning probably requires language use.
| Even if you don't think AI models can "really" reason - more on
| that later - even simulated reasoning has to be reasoning in
| human language.
|
| That is an unreasonable assumption. In case of LLMs it seems
| wasteful to transform a point from latent space into a random
| token and lose information. In fact, I think in near future it
| will be the norm for MLLMs to "think" and "reason" without
| outputting a single "word".
|
| > Whether AI reasoning is "real" reasoning or just a mirage can
| be an interesting question, but it is primarily a philosophical
| question. It depends on having a clear definition of what "real"
| reasoning is, exactly.
|
| It is not a "philosophical" (by which the author probably meant
| "practically inconsequential") question. If the whole reasoning
| business is just rationalization of pre-computed answers or
| simply a means to do some computations because every token
| provides only a fixed amount of computation to update the model's
| state, then it doesn't make much sense to focus on improving the
| quality of chain-of-thought output from human POV.
| kazinator wrote:
| Not all reasoning requires language. Symbolic reasoning uses
| language.
|
| Real-time spatial reasoning like driving a car and not hitting
| things does not seem linguistic.
|
| Figuring out how to rotate a cabinet so that it will clear
| through a stairwell also doesn't seem like it requires
| language, only to communicate the solution to someone else
| (where language can turn into a hindrance, compared to a
| diagram or model).
| llllm wrote:
| Pivot!
| kazinator wrote:
| Can we be Friends?
| vmg12 wrote:
| Solutions to some of the hardest problems I've had have only
| come after a night of sleep or when I'm out on a walk and I'm
| not even thinking about the problem. Maybe what my brain was
| doing was something different from reasoning?
| andybak wrote:
| This is a very important point and mostly absent from the
| conversation.
|
| We have many words that almost mean the same thing or can
| mean ment different things - and conversations about
| intelligence and consciousness are riddled with them.
| tempodox wrote:
| > This is a very important point and mostly absent from the
| conversation.
|
| That's because when humans are mentioned at all in the
| context of coding with "AI", it's mostly as bad and buggy
| simulations of those perfect machines.
| safety1st wrote:
| I'm pretty much a layperson in this field, but I don't
| understand why we're trying to teach a stochastic text
| transformer to reason. Why would anyone expect that approach to
| work?
|
| I would have thought the more obvious approach would be to
| couple it to some kind of symbolic logic engine. It might
| transform plain language statements into fragments conforming
| to a syntax which that engine could then parse
| deterministically. This is the Platonic ideal of reasoning that
| the author of the post pooh-poohs, I guess, but it seems to me
| to be the whole point of reasoning; reasoning is the
| application of logic in evaluating a proposition. The LLM might
| be trained to generate elements of the proposition, but it's
| too random to apply logic.
| shannifin wrote:
| Problem is, even with symbolic logic, reasoning is not
| completely deterministic. Whether one can get to a set of
| given axioms from a given proposition is sometimes
| undecidable.
| limaoscarjuliet wrote:
| > In fact, I think in near future it will be the norm for MLLMs
| to "think" and "reason" without outputting a single "word".
|
| It will be outputting something, as this is the only way it can
| get more compute - output a token, then all context + the next
| token is fed through the LLM again. It might not be presented
| to the user, but that's a different story.
| potsandpans wrote:
| > It is not a "philosophical" (by which the author probably
| meant "practically inconsequential") question.
|
| I didn't take it that way. I suppose it depends on whether or
| not you believe philosophy is legitimate
| pornel wrote:
| You're looking at this from the perspective of what would make
| sense for the model to produce. Unfortunately, what really
| dictates the design of the models is what we can train the
| models with (efficiently, at scale). The output is then roughly
| just the reverse of the training. We don't even want AI to be
| an "autocomplete", but we've got tons of text, and a relatively
| efficient method of training on all prefixes of a sentence at
| the same time.
|
| There have been experiments with preserving embedding vectors
| of the tokens exactly without loss caused by round-tripping
| through text, but the results were "meh", presumably because it
| wasn't the _input_ format the model was trained on.
|
| It's conceivable that models trained on some vector "neuralese"
| that is completely separate from text would work better, but
| it's a catch 22 for training: the internal representations
| don't exist in a useful sense until the model is trained, so we
| don't have anything to feed into the models to make them use
| them. The internal representations also don't stay stable when
| the model is trained further.
| skywhopper wrote:
| I mostly agree with the point the author makes that "it doesn't
| matter". But then again, it does matter, because LLM-based
| products are marketed based on "IT CAN REASON!" And so, while it
| may not matter, per se, how an LLM comes up with its results, to
| the extent that people choose to rely on LLMs because of
| marketing pitches, it's worth pushing back on those claims if
| they are overblown, using the same frame that the marketers use.
|
| That said, this author says this question of whether models "can
| reason" is the least interesting thing to ask. But I think the
| least interesting thing you can do is to go around taking every
| complaint about LLM performance and saying "but humans do the
| exact same thing!" Which is often not true, but again, _doesn 't
| matter_.
| cess11 wrote:
| Yes, it's a mirage, since this type of software is an opaque
| simulation, perhaps even a simulacra. It's reasoning in the same
| sense as there are terrorists in a game of Counter-Strike.
| jrm4 wrote:
| Current thought, for me there's a lot of hand-wringing about what
| is "reasoning" and what isn't. But right now perhaps the question
| might be boiled down to -- "is the bottleneck merely hard drive
| space/memory/computing speed?"
|
| I kind of feel like we won't be able to even begin to test this
| until a few more "Moore's law" cycles.
| j45 wrote:
| Currently it feels like it's more simulated chain-of-thought /
| reasoning, sometimes very consistent, but simulated, partially
| because it's statistically generated and non-deterministic (not
| the exact same path to the similar or same each response run).
| js8 wrote:
| I think LLM's chain of thought is reasoning. When trained, LLM
| sees lot of examples like "All men are mortal. Socrates is a
| man." followed by "Therefore, Socrates is mortal.". This causes
| the transformer to learn rule "All A are B. C is A." is often
| followed by "Therefore, C is B." And so it can apply this logical
| rule, predictively. (I have converted the example from latent
| space to human language for clarity.)
|
| Unfortunately, sometimes LLM also learns "All A are C. All B are
| C." is followed by "Therefore, A is B.", due to bad example in
| the training data. (More insidiously, it might learn this rule
| only in a special case.)
|
| So it learns some logic rules but not consistently. This lack of
| consistency will cause it to fail on larger problems.
|
| I think NNs (transformers) could be great in heuristic suggesting
| which valid logical rules (could be even modal or fuzzy logic) to
| apply in order to solve a certain formalized problem, but not so
| great at coming up with the logic rules themselves. They could
| also be great at transforming the original problem/question from
| human language into some formal logic, that would then be
| resolved using heuristic search.
| gshulegaard wrote:
| > but we know that reasoning is an emergent capability!
|
| Do we though? There is widespread discussion and growing momentum
| of belief in this, but I have yet to see conclusive evidence of
| this. That is, in part, why the subject paper exists...it seeks
| to explore this question.
|
| I think the author's bias is bleeding fairly heavily into his
| analysis and conclusions:
|
| > Whether AI reasoning is "real" reasoning or just a mirage can
| be an interesting question, but it is primarily a philosophical
| question. It depends on having a clear definition of what "real"
| reasoning is, exactly.
|
| I think it's pretty obvious that the researchers are exploring
| whether or not LLMs exhibit evidence of _Deductive_ Reasoning
| [1]. The entire experiment design reflects this. Claiming that
| they haven't defined reasoning and therefore cannot conclude or
| hope to construct a viable experiment is...confusing.
|
| The question of whether or not an LLM can take a set of base
| facts and compose them to solve a novel/previously unseen problem
| is interesting and what most people discussing emergent reasoning
| capabilities of "AI" are tacitly referring to (IMO). Much like
| you can be taught algebraic principles and use them to solve for
| "x" in equations you have never seen before, can an LLM do the
| same?
|
| To which I find this experiment interesting enough. It presents a
| series of facts and then presents the LLM with tasks to see if it
| can use those facts in novel ways not included in the training
| data (something a human might reasonably deduce). To which their
| results and summary conclusions are relevant, interesting, and
| logically sound:
|
| > CoT is not a mechanism for genuine logical inference but rather
| a sophisticated form of structured pattern matching,
| fundamentally bounded by the data distribution seen during
| training. When pushed even slightly beyond this distribution, its
| performance degrades significantly, exposing the superficial
| nature of the "reasoning" it produces.
|
| > The ability of LLMs to produce "fluent nonsense"--plausible but
| logically flawed reasoning chains--can be more deceptive and
| damaging than an outright incorrect answer, as it projects a
| false aura of dependability.
|
| That isn't to say LLMs aren't useful, just exploring it's
| boundaries. To use legal services as an example, using an LLM to
| summarize or search for relevant laws, cases, or legal precedent
| is something it would excel at. But don't ask an LLM to formulate
| a logical rebuttal to an opposing council's argument using those
| references.
|
| Larger models and larger training corpuses will expand that
| domain and make it more difficult for individuals to discern this
| limit; but just because you can no longer see a limit doesn't
| mean there is none.
|
| And to be clear, this doesn't diminish the value of LLMs. Even
| without true logical reasoning LLMs are quite powerful and useful
| tools.
|
| [1] https://en.wikipedia.org/wiki/Logical_reasoning
| dawnofdusk wrote:
| >but we know that reasoning is an emergent capability!
|
| This is like saying in the 70s that we know only the US is
| capable of sending a man to the moon. Just because the reasoning
| developed in a particular context means very little about what
| the bare minimum requirements for that reasoning are.
|
| Overall I am not a fan of this blogpost. It's telling how long
| the author gets hung up on a paper making "broad philosophical
| claims about reasoning", based on what reads to me as fairly
| typical scientific writing style. It's also telling how highly
| cherry-picked the quotes they criticize from the paper are. Here
| is some fuller context:
|
| >An expanding body of analyses reveals that LLMs tend to rely on
| surface-level semantics and cluesrather than logical procedures
| (Chen et al., 2025b; Kambhampati, 2024; Lanham et al., 2023;
| Stechly et al., 2024). LLMs construct superficial chains of logic
| based on learned token associations, often failing on tasks that
| deviate from commonsense heuristics or familiar templates (Tang
| et al., 2023). In the reasoning process, performance degrades
| sharply when irrelevant clauses are introduced, which indicates
| that models cannot grasp the underlying logic (Mirzadeh et al.,
| 2024)
|
| >Minor and semantically irrelevant perturbations such as
| distractor phrases or altered symbolic forms can cause
| significant performance drops in state-of-the-art models
| (Mirzadeh et al., 2024; Tang et al., 2023). Models often
| incorporate such irrelevant details into their reasoning,
| revealing a lack of sensitivity to salient information. Other
| studies show that models prioritize the surface form of reasoning
| over logical soundness; in some cases, longer but flawed
| reasoning paths yield better final answers than shorter, correct
| ones (Bentham et al., 2024). Similarly, performance does not
| scale with problem complexity as expected--models may overthink
| easy problems and give up on harder ones (Shojaee et al., 2025).
| Another critical concern is the faithfulness of the reasoning
| process. Intervention-based studies reveal that final answers
| often remain unchanged even when intermediate steps are falsified
| or omitted (Lanham et al., 2023), a phenomenon dubbed the
| illusion of transparency (Bentham et al., 2024; Chen et al.,
| 2025b).
|
| You don't need to be a philosopher to realize that these problems
| seem quite distinct from the problems with human reasoning. For
| example, "final answers remain unchanged even when intermediate
| steps are falsified or omitted"... can humans do this?
___________________________________________________________________
(page generated 2025-08-14 23:02 UTC)