hngopher.com

       [HN Gopher] Reasoning models reason well, until they don't
       ___________________________________________________________________
        
       Reasoning models reason well, until they don't
        
       Author : optimalsolver
       Score  : 200 points
       Date   : 2025-10-31 09:23 UTC (13 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | iLoveOncall wrote:
       | > [...] recent studies show that transformers and LLMs fail
       | catastrophically once reasoning problems exceed modest
       | complexity. We revisit these findings through the lens of large
       | reasoning models (LRMs) -- LLMs fine-tuned with incentives for
       | step-by-step argumentation and self-verification
       | 
       | This was the obvious outcome of the study (don't get me wrong,
       | obvious outcomes are still worth having research on).
       | 
       | "LRMs" *are* just LLMs. There's no such thing as a reasoning
       | model, it's just having an LLM write a better prompt than the
       | human would and then sending it to the LLM again.
       | 
       | Despite what Amodei and Altman want Wall Street to believe, they
       | did not suddenly unlock reasoning capabilities in LLMs by
       | essentially just running two different prompts in sequence to
       | answer the user's question.
       | 
       | The truly amazing thing is that reasoning models show ANY
       | improvement at all compared to non-reasoning models, when they're
       | the same exact thing.
        
         | sothatsit wrote:
         | What do you mean by reasoning?
         | 
         | If you mean solving logic problems, then reasoning LLMs seem to
         | pass that bar as they do very well programming and maths
         | competitions. Reasoning LLMs can also complete problems like
         | multiplying large numbers, which requires applying some sort of
         | algorithm where the results cannot just be memorised. They also
         | do this much better than standard pre-trained LLMs with no RL.
         | 
         | So, that makes me come back to this question of what definition
         | of reasoning do people use that reasoning models do not meet?
         | They're not perfect, obviously, but that is not a requirement
         | of reasoning if you agree that humans can reason. We make
         | mistakes as well, and we also suffer under higher complexity.
         | Perhaps they are less reliable in knowing when they have made
         | mistakes or not than trained humans, but I wouldn't personally
         | include reliability in my definition for reasoning (just look
         | at how often humans make mistakes in tests).
         | 
         | I am yet to see any serious, reasoned, arguments that suggest
         | why the amazing achievements of reasoning LLMs in maths and
         | programming competitions, on novel problems, does not count as
         | "real reasoning". It seems much more that people just don't
         | like the idea of LLMs reasoning, and so reject the idea without
         | giving an actual reason themselves, which seems somewhat ironic
         | to me.
        
           | fsloth wrote:
           | I guess we mean here "usefull reasoning" instead of the
           | idiot-savant. I mean it's a fair ask since these are marketed
           | as _tools_ you can use to implement _industrial processes_
           | and even replace you human workers.
           | 
           | In that I guess the model does not need to be the most
           | reasonable intepreter of vague and poorly formulated user
           | inputs but I think to improve a bit at least, to become
           | usefull general appliances and not just test-scoring-
           | automatons.
           | 
           | The key differentiator here is that tests generally _are made
           | to be unambiguously scoreable_. Real world problems are often
           | more vague from the point of view of optimal outcome.
        
             | sothatsit wrote:
             | Thanks. So, people are extending "reasoning" to include
             | making good decisions, rather than just solving logic
             | problems. That makes sense to me that if people use that
             | definition, LLMs are pretty bad at "reasoning".
             | 
             | Although, I would argue that this is not reasoning at all,
             | but rather "common sense" or the ability to have a broader
             | perspective or think of the future. These are tasks that
             | come with experience. That is why these do not seem like
             | reasoning tasks to me, but rather soft skills that LLMs
             | lack. In my mind these are pretty separate concerns to
             | whether LLMs can logically step through problems or apply
             | algorithms, which is what I would call reasoning.
        
               | hansmayer wrote:
               | Ah yes then, let me then unchain my LLM on those nasty
               | unsolved math and logic problems I've absolutely not be
               | struggling with in the course of my career.
        
               | sothatsit wrote:
               | A lot of maths students would also struggle to contribute
               | to frontier math problems, but we would still say they
               | are reasoning. Their skill at reasoning might not be as
               | good as professional mathematicians, but that does not
               | stop us from recognising that they can solve logic
               | problems without memorisation, which is a form of
               | reasoning.
               | 
               | I am just saying that LLMs have demonstrated they can
               | reason, at least a little bit. Whereas it seems other
               | people are saying that LLM reasoning is flawed, which
               | does not negate the fact that they can reason, at least
               | some of the time.
               | 
               | Maybe generalisation is one area where LLM's reasoning is
               | weakest though. They can be near-elite performance at
               | nicely boxed up competition math problems, but their
               | performance dramatically drops on real-world problems
               | where things aren't so neat. We see similar problems in
               | programming as well. I'd argue the progress on this has
               | been promising, but other people would probably
               | vehemently disagree with that. Time will tell.
        
               | vidarh wrote:
               | Thank you for picking at this.
               | 
               | A lot of people appear to be - often not consciously or
               | intentionally - setting the bar for "reasoning" at a
               | level many or most people would not meet.
               | 
               | Sometimes that is just a reaction to wanting an LLM that
               | is producing result that is good for their own level.
               | Sometimes it reveals a view of fellow humans that would
               | be quite elitist if stated outright. Sometimes it's a
               | kneejerk attempt at setting the bar at a point that would
               | justify a claim that LLMs aren't reasoning.
               | 
               | Whatever the reason, it's a massive pet peeve of mine
               | that it is rarely made explicit in these conversations,
               | and it makes a lot of these conversations pointless
               | because people keep talking past each other.
               | 
               | For my part a lot of these models often clearly reason by
               | my standard, _even if poorly_. People also often reason
               | poorly, even when they demonstrably attempt to reason
               | step by step. Either because they have motivations to
               | skip over uncomfortable steps, or because they don 't
               | know how to do it right. But we still would rarely claim
               | they are not capable of reasoning.
               | 
               | I wish more evaluations of LLMs would establish a human
               | baseline to test them against for much this reason. It
               | would be illuminating in terms of actually telling us
               | more about how LLMs match up to humans in different
               | areas.
        
               | cryptonym wrote:
               | Computers have forever been doing stuff people can't do.
               | 
               | The real question is how useful this tool is and if this
               | is as transformative as investors expect. Understanding
               | its limits is crucial.
        
               | cryptonym wrote:
               | That's the real deal.
               | 
               | They say LLM are PhD-level. Despite billion dollars, PhD-
               | LLMs sure are not contributing a lot solving known
               | problems. Except of course few limited marketing stunts.
        
               | fsloth wrote:
               | IMHO that's the key differentiator.
               | 
               | You can give a human PhD an _unsolved problem_ in field
               | adjacent to their expertise and expect some reasonable
               | resolution. LLM PhD:s solve only known problems.
               | 
               | That said humans can also be really bad problem solvers.
               | 
               | If you don't care about solving the problem and only want
               | to create paperwork for bureaucracy I guess you don't
               | care either way ("My team's on it!") but companies that
               | don't go out of business generally recognize pretty soon
               | lack of outcomes where it matters.
        
               | nl wrote:
               | > LLM PhD:s solve only known problems.
               | 
               | Terry Tao would disagree:
               | https://mathstodon.xyz/@tao/114508029896631083
               | 
               | https://deepmind.google/discover/blog/alphaevolve-a-
               | gemini-p...
        
               | hansmayer wrote:
               | I wish our press was not effectively muted or bought by
               | the money, so none of the journos has cojones to call out
               | the specific people who were blabbing about PhD-levels,
               | AGI etc. They should be god damn calling them out every
               | single day, essentially doing their job, but they are now
               | too timid for that.
        
               | vidarh wrote:
               | I've "unchained" my LLM on a lot of problems that I
               | probably _could_ solve, but that would take me time I don
               | 't have, and that it has solved in many case faster than
               | I could. It may not be good enough to solve problems that
               | are _beyond_ us for most of us, but it certainly can
               | solve a lot of problems for a lot of us that have gone
               | unsolved for lack of resources.
        
               | cryptonym wrote:
               | Can solve problems you already know how to solve, if you
               | micro-manage it and it'll BS a lot on the way.
               | 
               | If this is the maximum AGI-PhD-LRM can do, that'll be
               | disappointing compared to investments. Curious to see
               | what all this will become in few years.
        
               | vidarh wrote:
               | I'm not usually micro-managing it, that's the point.
               | 
               | I _sometimes_ do on problems where I have particular
               | insight, but I mostly find it is _far more effective_ to
               | give it test cases and give it instructions on how to
               | approach a task, and then _let it iterate_ with little to
               | no oversight.
               | 
               | I'm letting Claude Code run for longer and longer with
               | --dangerously-skip-permissions, to the point I'm
               | pondering rigging up something to just keep feeding it
               | "continue" and run it in parallel on multiple problems.
               | 
               | Because at least when you have a good way of measuring
               | success, it works.
        
             | hansmayer wrote:
             | ^^This is a great view and it seems generally widely
             | understood by the file and rank techies. I feel pitty for
             | the general public retail investors which are about to be
             | left holding the bag for the VCs, after a certain major
             | <ahem> champion goes into IPO soon.
        
           | js8 wrote:
           | > So, that makes me come back to this question of what
           | definition of reasoning do people use that reasoning models
           | do not meet?
           | 
           | The models can learn reasoning rules, but they are not able
           | to apply them consistently or recognize the rules they have
           | learned are inconsistent. (See also my other comment which
           | references comments I made earlier.)
           | 
           | And I think they can't without a tradeoff, as I commented
           | https://news.ycombinator.com/item?id=45717855 ; the
           | consistency requires certain level of close-mindedness.
        
             | sothatsit wrote:
             | Yes, so I think in this case we use different definitions
             | of reasoning. You include reliability as a part of
             | reasoning, whereas I do not.
             | 
             | I would argue that humans are not 100% reliable in their
             | reasoning, and yet we still claim that they can reason. So,
             | even though I would agree that the reasoning of LLMs is
             | much less reliable, careful, and thoughtful than smart
             | humans, that does not mean that they are not reasoning.
             | Rather, it means that their reasoning is more unreliable
             | and less well-applied than people. But they are still
             | performing reasoning tasks (even if their application of
             | reasoning can be flawed).
             | 
             | Maybe the problem is that I am holding out a minimum bar
             | for LLMs to jump to count as reasoning (demonstrated
             | application of logical algorithms to solve novel problems
             | in any domain), whereas other people are holding the bar
             | higher (consistent and logical application of rules in
             | all/most domains).
        
               | js8 wrote:
               | The problem is if you're not able to apply the reasoning
               | rules consistently, then you will always fail on large
               | enough problem. If you have an inconsistent set of
               | reasoning rules, then you can set up a problem as a trap
               | so that the reasoning fails.
               | 
               | You can argue that damaged toaster is still a toaster,
               | conceptually. But if it doesn't work, then it's useless.
               | As it stands, models lack ability to reason because they
               | can fail to reason and you can't do anything about it. In
               | case of humans, it's valid to say they can reason,
               | because humans can at least fix themselves, models can't.
        
               | sothatsit wrote:
               | The reasoning does not need to be 100% accurate to be
               | useful. Humans are rarely 100% accurate at anything, and
               | yet over time we can build up large models of problems
               | using verification and review. We can do the exact same
               | thing with LLMs.
               | 
               | The best example of this is Sean Heelan, who used o3 to
               | find a real security vulnerability in the Linux kernel:
               | https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-
               | cve-...
               | 
               | Sean Heelan ran o3 100 times, and it found a known
               | vulnerability in 8% of runs. For a security audit, that
               | is immensely useful, since an expert can spend the time
               | to look at the results from a dozen runs and quickly
               | decide if there is anything real. Even more remarkably
               | though, this same testing exposed a zero-day that they
               | were not even looking for. That is pretty incredible for
               | a system that makes mistakes.
               | 
               | This is why LLM reasoning absolutely does not need to be
               | perfect to be useful. Human reasoning is inherently
               | flawed as well, and yet through systems like peer review
               | and reproducing results, we can still make tremendous
               | progress over time. It is just about figuring out systems
               | of verification and review so that we don't need to trust
               | any LLM output blindly. That said, greater reliability
               | would be massively beneficial to how easy it is to get
               | good results from LLMs. But it's not required.
        
           | riku_iki wrote:
           | > then reasoning LLMs seem to pass that bar as they do very
           | well programming and maths competitions.
           | 
           | it could be this is just result of good stochastic parroting
           | and not reasoning. Both of those niches are narrow with high
           | amount of training data (e.g. corps buying solutions from
           | leetcode and training LLMs on them).
           | 
           | From another hand we see that LLMs fail in more complex
           | environment: e.g. ask to build some new feature in postgres
           | database.
        
             | sothatsit wrote:
             | This is clearly false. LLMs being able to multiply large
             | numbers is the clear example to me that there is more than
             | just memorisation going on. They cannot just memorise the
             | answers to multipling huge numbers like they do.
             | 
             | That's not to mention that these programming competition
             | problems are designed to be novel. They are as novel as the
             | competition designers can get while sticking to the bounds
             | of the competition. This is clearly not stochastic parrot
             | behaviour.
             | 
             | Additionally, them falling over in large codebases is not
             | evidence that they cannot reason over smaller well-defined
             | problems. It is just evidence that their reasoning has
             | limits, which should not be surprising to anyone. Humans
             | also have limits in our reasoning. That does not mean we do
             | not reason.
        
               | riku_iki wrote:
               | I think you just made lots of handwaving statements. Here
               | is result which says LLMs can't do multi-digit
               | multiplications well: https://arxiv.org/pdf/2510.00184
        
         | sirwhinesalot wrote:
         | > The truly amazing thing is that reasoning models show ANY
         | improvement at all compared to non-reasoning models, when
         | they're the same exact thing.
         | 
         | It's because they do more compute. The more tokens "spent" the
         | better the accuracy. Same reason they spit out a paragraph of
         | text instead of just giving a straight answer in non-reasoning
         | mode.
        
         | jpcompartir wrote:
         | I can't remember which paper it's from, but isn't the variance
         | in performance explained by # of tokens generated? i.e. more
         | tokens generated tends towards better performance.
         | 
         | Which isn't particularly amazing, as # of tokens generated is
         | basically a synonym in this case for computation.
         | 
         | We spend more computation, we tend towards better answers.
        
         | qsort wrote:
         | Don't they have a significant RL component? The "we'll just
         | make it bigger" idea that was peddled a lot after GPT3.5 was
         | nonsense, but that's not the only thing they're doing right
         | now.
        
           | ACCount37 wrote:
           | "We'll just make it bigger" works. RLVR just gives better
           | performance gains and spends less inference compute - as long
           | as you have a solid way of verifying the tasks.
           | 
           | A simplified way of thinking about it is: pretraining gives
           | LLMs useful features, SFT arranges them into useful
           | configurations, RLVR glues them together and makes them work
           | together well, especially in long reasoning traces. Makes
           | sense to combine it all in practice.
           | 
           | How much pretraining gives an LLM depends on the scale of
           | that LLM, among other things. But raw scale is bounded by the
           | hardware capabilities and the economics - of training and
           | especially of inference.
           | 
           | Scale is still quite desirable - GPT-4.5 scale models are
           | going to become the norm for high end LLMs quite soon.
        
             | qsort wrote:
             | I'm not against "we'll make it bigger" (although it's as of
             | yet unknown if it hits diminishing returns, 4.5 isn't
             | exactly remembered as a great release), I'm against "we'll
             | _just_ (i.e.  'only') make it bigger".
             | 
             | I'm doubtful you'd have useful LLMs today if labs hadn't
             | scaled in post-training.
        
         | antonvs wrote:
         | > The truly amazing thing is that reasoning models show ANY
         | improvement at all compared to non-reasoning models, when
         | they're the same exact thing.
         | 
         | Why is that amazing? It seems expected. Use a tool differently,
         | get different results.
        
       | equinox_nl wrote:
       | But I also fail catastrophically once a reasoning problem exceeds
       | modest complexity.
        
         | monkeydust wrote:
         | But you recognise you are likely to fail and thus dont respond
         | or redirect the problem to someone who has a greater likelihood
         | of not failing.
        
           | antonvs wrote:
           | I've had models "redirect the problem to someone who has a
           | greater likelihood of not failing". Gemini in particular will
           | do this when it runs into trouble.
           | 
           | I don't find all these claims that models are somehow worse
           | than humans in such areas convincing. Yes, they're worse in
           | some respects. But when you're talking about things related
           | to failures and accuracy, they're mostly superhuman.
           | 
           | For example, how many humans can write hundred of lines of
           | code (in seconds mind you) and regularly not have any syntax
           | errors or bugs?
        
             | ffsm8 wrote:
             | > For example, how many humans can write hundred of lines
             | of code (in seconds mind you) and regularly not have any
             | syntax errors or bugs?
             | 
             | Ez, just use codegen.
             | 
             | Also the second part (not having bugs) is unlikely to be
             | true for the LLM generated code, whereas traditional
             | codegen will actually generate code with pretty much no
             | bugs.
        
               | vidarh wrote:
               | I have Claude reducing the number of bugs in my
               | traditional codegen right now.
        
             | pessimizer wrote:
             | > I've had models "redirect the problem to someone who has
             | a greater likelihood of not failing". Gemini in particular
             | will do this when it runs into trouble.
             | 
             | I have too, and I sense that this is something that has
             | been engineered in rather than coming up naturally. I like
             | it very much and they should do it a lot more often.
             | They're allergic to "I can't figure this out" but hearing
             | "I can't figure this out" gives me the alert to help it
             | over the hump.
             | 
             | > But when you're talking about things related to failures
             | and accuracy, they're mostly superhuman.
             | 
             | Only if you consider speed to failure and inaccuracy.
             | They're very much subhuman in output, but you can make them
             | retry a lot in a short time, and refine what you're asking
             | them each time to avoid the mistakes they're repeatedly
             | making. But that's _you_ doing the work.
        
           | exe34 wrote:
           | If that were true, we would live in a utopia. People
           | vote/legislate/govern/live/raise/teach/preach without ever
           | learning to reason correctly.
        
         | davidhs wrote:
         | Do you? Don't you just halt and say this is too complex?
        
           | p_v_doom wrote:
           | Nope, audacity and Dunning-Krueger all the way, baby
        
           | dspillett wrote:
           | Some would consider that to be failing catastrophically. The
           | task is certainly failed.
        
             | carlmr wrote:
             | Halting is sometimes preferable to thrashing around and
             | running in circles.
             | 
             | I feel like if LLMs "knew" when they're out of their depth,
             | they could be much more useful. The question is whether
             | knowing when to stop can be meaningfully learned from
             | examples with RL. From all we've seen the hallucination
             | problem and this stopping problem all boil down to this
             | problem that you could teach the model to say "I don't
             | know" but if that's part of the training dataset it might
             | just spit out "I don't know" to random questions, because
             | it's a likely response in the realm of possible responses,
             | instead of spitting out "I don't know" to not knowing.
             | 
             | SocratesAI is still unsolved, and LLMs are probably not the
             | path to get knowing that you know nothing.
        
               | ukuina wrote:
               | > if LLMs "knew" when they're out of their depth, they
               | could be much more useful.
               | 
               | I used to think this, but no longer sure.
               | 
               | Large-scale tasks just grind to a halt with more modern
               | LLMs because of this perception of impassable complexity.
               | 
               | And it's not that they need extensive planning, the LLM
               | knows what needs to be done (it'll even tell you!), it's
               | just more work than will fit within a "session"
               | (arbitrary) and so it would rather refuse than get
               | started.
               | 
               | So you're now looking at TODOs, and hierarchical plans,
               | and all this unnecessary pre-work even when the task
               | scales horizontally very well (if it just jumped into
               | it).
        
             | benterix wrote:
             | This seems to be the stance of creators of agentic coders.
             | They are so bound on creating something, even if this
             | something makes no sense whatsoever.
        
             | LunaSea wrote:
             | I would consider that detecting your own limits when trying
             | to solve a problem is preferable to having the illusion of
             | thinking that your solution is working and correct.
        
           | moritzwarhier wrote:
           | Ah yes, the function that halts if the input problem would
           | take too long to halt.
           | 
           | But yes, I assume you mean they abort their loop after a
           | while, which they do.
           | 
           | This whole idea of a "reasoning benchmark" doesn't sit well
           | with me. It seems still not well-defined to me.
           | 
           | Maybe it's just bias I have or my own lack of intelligence,
           | but it seems to me that using language models for "reasoning"
           | is still more or less a gimmick and convenience feature (to
           | automate re-prompts, clarifications etc, as far as possible).
           | 
           | But reading this pop-sci article from summer 2022 seems like
           | this definition problem hasn't changed very much since then.
           | 
           | Although it's about AI progress before ChatGPT and it doesn't
           | even mention the GPT base models. Sure, some of the tasks
           | mentioned in the article seem dated today.
           | 
           | But IMO, there is still no AI model that can be trusted to,
           | for example, accurately summarize a Wikipedia article.
           | 
           | Not all humans can do that either, sure. But humans are
           | better at knowing what they don't know, and deciding what
           | other humans can be trusted. And of course, none of this is
           | an arithmetic or calculation task.
           | 
           | https://www.science.org/content/article/computers-ace-iq-
           | tes...
        
         | AlecSchueler wrote:
         | I also fail catastrophically when trying to push nails through
         | walls by I expect my hammer to do better.
        
           | moffkalast wrote:
           | I have one hammer and I expect it to work on every nail and
           | screw. If it's not a general hammer, what good is it now?
        
             | arethuza wrote:
             | You don't need a "general hammer" - they are old fashioned
             | - you need a "general-purpose tool-building factory factory
             | factory":
             | 
             | https://www.danstroot.com/posts/2018-10-03-hammer-factories
        
               | code_martial wrote:
               | Reminds me of a 10 letter Greek word that starts with a
               | k.
        
           | hshdhdhehd wrote:
           | Gold and shovels might be a more fitting analogy for AI
        
         | raddan wrote:
         | Yes, but you are not a computer. There is no point building
         | another human. We have plenty of them.
        
           | aoeusnth1 wrote:
           | Others would beg to disagree that we should be build a
           | machine which can act as a human.
        
       | WesolyKubeczek wrote:
       | It's because they generate a seeming of reasoning, and don't
       | actually reason!
       | 
       |  _(Slams the door angrily)_
       | 
       |  _(stomps out angrily)_
       | 
       |  _(touches the grass angrily)_
        
         | samuell wrote:
         | Yea, a bit like a cheating student rote memorizing and copying
         | another students technique for solving a type of problem, and
         | failing hard as soon as there's too much variation from the
         | original problem.
        
           | fsloth wrote:
           | Yes!
           | 
           | That said the input space of supported problems is quite
           | large and you can configure the problem parametrs quite
           | flexibly.
           | 
           | I guess the issue is that what the model _actually_ provides
           | you is this idiot savant who has pre-memorized everything
           | without offering a clear index that would disambiguate well-
           | supported problems from "too difficult" (i.e. novel) ones
        
         | brap wrote:
         | What is to reason, if not to generate a seeming of reasoning?
         | 
         |  _(tips fedora)_
        
           | hshdhdhehd wrote:
           | You said the quiet part out loud of political debate.
           | 
           | (does something)
        
       | brap wrote:
       | I wonder if we can get models to reason in a structured and
       | verifiable way, like we have formal logic in math.
        
         | Frieren wrote:
         | For that, you already have classical programming. It is great
         | at formal logic math.
        
           | brap wrote:
           | I think trying to accurately express natural language
           | statements as values and logical steps as operators is going
           | to be very difficult. You also need to take into account
           | ambiguity and subtext and things like that.
           | 
           | I actually believe it is technically possible, but is going
           | to be very hard.
        
             | nl wrote:
             | This is where you get the natural language tool to write
             | the formal logic.
             | 
             | ChatGPT knows WebPPL really well for example.
        
               | brap wrote:
               | You will need a formal language first.
               | 
               | Take this statement for example:
               | 
               | >ChatGPT knows WebPPL really well
               | 
               | What formal language can express this statement? What
               | will the text be parsed into? Which transformations can
               | you use to produce other truthful (and interesting)
               | statements from it? Is this flexible enough to capture
               | everything that can be expressed in English?
               | 
               | The closest that comes to mind is Prolog, but it doesn't
               | really come close.
        
         | measurablefunc wrote:
         | It's doing so already. All code executed on a computer,
         | especially neural networks w/o any loops are simply doing
         | boolean arithmetic. In fact, the computer can't do anything
         | else other than boolean arithmetic.
        
       | alyxya wrote:
       | The key point the paper seems to make is that existing benchmarks
       | have relatively low complexity on reasoning complexity, so they
       | made a new dataset DeepRD with arbitrarily large reasoning
       | complexity and demonstrated that existing models fail at a
       | complex enough problem. Complexity is defined from the complexity
       | of a graph created by modeling the problem as a graph and
       | determining the traversals needed to go from some source node to
       | a target node.
       | 
       | My main critique is that I don't think there's evidence that this
       | issue would persist after continuing to scale models to be larger
       | and doing more RL. With a harness like what coding agents do
       | these days and with sufficient tool use, I bet models could go
       | much further on that reasoning benchmark. Otherwise, if the
       | reasoning problem were entirely done within a single context
       | window, it's expected that a complex enough reasoning problem
       | would be too difficult for the model to solve.
        
         | jeremyjh wrote:
         | The burden of evidence here is on you. They don't need to prove
         | LRMs can't scale to meet these problems; their only claim is
         | current models can't handle these problems. Others will take
         | this up as a challenge - and chances may be good they will
         | overcome it. This is how science works.
        
           | alyxya wrote:
           | They can't claim current models aren't able to handle these
           | problems if they didn't use a setup similar to coding agents
           | like Claude Code and OpenAI Codex. Using a suboptimal setup
           | is akin to verbally telling a person the whole reasoning
           | problem without letting them write down notes and expecting
           | them to memorize and solve it after only hearing it once.
        
             | jeremyjh wrote:
             | If the models can't do it they can make that claim. If you
             | want to make claims about agents then design that
             | experiment, collect the data and write a paper. That is how
             | science works.
        
             | rdedev wrote:
             | The thing they are testing for is reasoning performance. It
             | makes sense to not give tool access.
             | 
             | This is same as the critiques of the LLM paper by apple
             | where they showed that LLMs fail to solve the tower of
             | hanoi problem after a set number of towers. The test was to
             | see how well these models can reason out a long task.
             | People online were like they could solve that problem if
             | they had access to a coding enviornment. Again the test was
             | to check reasoning capability not if it knew how to code
             | and algorithm to solve the problem.
             | 
             | If model performance degrade a lot after a number of
             | reasoning steps it's good to know where the limits are.
             | Wheather the model had access to tools or not is orthogonal
             | to this problem
        
         | tomlockwood wrote:
         | So the answer is a few more trillion?
        
           | code_martial wrote:
           | It's a worthwhile answer if it can be proven correct because
           | it means that we've found a way to create intelligence, even
           | if that way is not very efficient. It's still one step better
           | than not knowing how to do so.
        
             | tomlockwood wrote:
             | So we're sending a trillion on faith?
        
               | code_martial wrote:
               | No, that's not what I said.
        
               | tomlockwood wrote:
               | Why are we sending the trillion?
        
               | measurablefunc wrote:
               | It must be deposited into OpenAI's bank account so that
               | they can then deposit it into NVIDIA's account who can
               | then in turn make a deal w/ OpenAI to deposit it back
               | into OpenAI's account for some stock options. I think you
               | can see how it works from here but if not then maybe one
               | of the scaled up "reasoning" AIs will figure it out for
               | you.
        
             | usrbinbash wrote:
             | > if it can be proven correct
             | 
             | Then the first step would be to prove that this works
             | _WITHOUT_ needing to burn through the trillions to do so.
        
         | usrbinbash wrote:
         | > I don't think there's evidence that this issue would persist
         | after continuing to scale models to be larger and doing more RL
         | 
         | And how much larger do we need to make the models? 2x? 3x? 10x?
         | 100x? How large do they need to get before scaling-up _somehow_
         | solves everything?
         | 
         | Because: 2x larger, means 2x more memory and compute required.
         | Double the cost or half the capacity. Would people still pay
         | for this tech if it doubles in price? Bear in mind, much of it
         | is already running at a loss even now.
         | 
         | And what if 2x isn't good enough? Would anyone pay for a 10x
         | larger model? Can we even realistically run such models as
         | anything other than a very expensive PoC and for a very short
         | time? And whos to say that even 10x will finally solve things?
         | What if we need 40x? Or 100x?
         | 
         | Oh, and of course: Larger models also require more data to
         | train them on. And while the Internet is huge, it's still
         | finite. And when things grow geometrically, even
         | `sizeof(internet)` eventually runs out ... and, in fact, may
         | have done so already [1] [2]
         | 
         | What if we actually discover that scaling up doesn't even work
         | at all, because of diminishing returns? Oh wait, looks like we
         | did that already: [3]
         | 
         | [1]: https://observer.com/2024/12/openai-cofounder-ilya-
         | sutskever...
         | 
         | [2]: https://biztechweekly.com/ai-training-data-crisis-how-
         | synthe...
         | 
         | [3]: https://garymarcus.substack.com/p/confirmed-llms-have-
         | indeed...
        
           | alyxya wrote:
           | Scaling applies to multiple dimensions simultaneously over
           | time. A frontier model today could be replicated a year later
           | with a model half the size, with a quarter of the FLOPS, etc.
           | I don't know the real numbers for optimization scaling, but
           | you could check out NanoGPT speedrun [1] as an example.
           | 
           | The best solution in the meantime is giving the LLM a harness
           | that allows tool use like what coding agents have. I suspect
           | current models are fully capable of solving arbitrary
           | complexity artificial reasoning problems here, provided that
           | they're used in the context of a coding agent tool.
           | 
           | [1] https://github.com/KellerJordan/modded-nanogpt
        
             | galaxyLogic wrote:
             | Some problems are just too complex and the effort to solve
             | them increases exponentially. No LLM can keep up with
             | exponenentially increasing effort unless you run them for
             | adequatte number of years.
        
             | Infinity315 wrote:
             | What? Fundamentally, information can only be so dense.
             | Current models may be inefficient w.r.t. information
             | density, however, there is a lower bound of compute
             | required. As a pathological example, we shouldn't expect a
             | megabyte worth of parameters to be able to encode the
             | entirety of Wikipedia.
        
         | BriggyDwiggs42 wrote:
         | The issue is that no matter how much you train them they don't
         | generalize to arbitrary sized problems. Sure you can push out
         | the horizon, but you won't make something that can solve the
         | problem always (assuming resources permit, and that isn't the
         | issue here).
        
         | galaxyLogic wrote:
         | > complexity of a graph created by modeling the problem as a
         | graph and determining the traversals needed to go from some
         | source node to a target node
         | 
         | Sounds interesting: Formalizing a problem once you know the
         | solution. Seems like LLMs can't do that, or if they could they
         | would evaluate where their problem solving is inadequate?
        
       | js8 wrote:
       | I think the explanation is pretty simple, as I said in my earlier
       | comment: https://news.ycombinator.com/item?id=44904107
       | 
       | I also believe the problem is we don't know what we want:
       | https://news.ycombinator.com/item?id=45509015
       | 
       | If we could make LLMs to apply a modest set of logic rules
       | consistently, it would be a win.
        
         | Sharlin wrote:
         | That's a pretty big "if". LLMs are by design entirely unlike
         | GoFAI reasoning engines. It's also very debatable whether it
         | makes any sense to try and hack LLMs into reasoning engines
         | when you could just... use a reasoning engine. Or have the LLM
         | to defer to one, which would play to their strength as
         | translators.
        
       | flimflamm wrote:
       | What confused me is the fact that in the paper all logical steps
       | are give. It basically check that when all relevant facts are
       | provided explicitly as links , how far and how complex a chain
       | can the model correctly follow before it breaks down?
       | 
       | So it's simpler than "reasoning". This is not necessarily a bad
       | thing as it boils down the reasoning to a simpler, more
       | controlled sub problem.
        
       | devlogstream wrote:
       | LLMs are like students, they can reason a bit, but real
       | understanding still takes time and practice.
        
         | hansmayer wrote:
         | What? The LLMs are _nothing_ like students (or any other human
         | for that matter).
        
       | anal_reactor wrote:
       | I'm yet to see a task that AI fails at that bottom 10% of
       | population wouldn't also fail at.
        
         | TheOtherHobbes wrote:
         | How about keeping a conversation going with family over
         | Thanksgiving? (Or local equivalent.)
        
           | randomNumber7 wrote:
           | This is something where the top 10% sometimes horribly fail.
        
         | Earw0rm wrote:
         | If by task you mean the written, intellectual variety, maybe.
        
         | layer8 wrote:
         | If I have the choice of performing an intellectual task myself,
         | or have it performed by someone from the bottom 10% of the
         | population, I'd probably rather perform it myself.
        
           | Der_Einzige wrote:
           | What happens when both choices lead to you doing it yourself?
        
         | acdha wrote:
         | The problem is consistency: AI tools usually produce output
         | which _sounds_ like the top 10% but you have to read it
         | carefully to find the bottom 10% parts. We're not used to that
         | because human performance isn't that inconsistent and we use
         | history and social factors: someone's performance goes down
         | when they're really drunk, but they rarely show up to work in
         | that state and it's obvious enough that other people recognize
         | that they shouldn't be trusted.
        
           | anal_reactor wrote:
           | > We're not used to that because human performance isn't that
           | inconsistent
           | 
           | It is. It's very common for socially apt people to bullshit
           | through things they don't know, or outright want to hide.
        
             | acdha wrote:
             | That's not inconsistent: your bluffer knows they're making
             | something up and is using their model of you to construct
             | something they think you'll believe. Someone who can do
             | that isn't going to suddenly forget how to count the number
             | of letters in a word.
        
               | anal_reactor wrote:
               | You're wrong. Counting the number of letters in a word is
               | a significantly more difficult task than lying, both for
               | humans and LLMs. Imagine going to a ghetto and asking
               | people "have you ever lied to someone and had them
               | believe the lie", and ask them to spell "continuously".
               | Children learn to lie before they learn to spell.
        
               | acdha wrote:
               | > Counting the number of letters in a word is a
               | significantly more difficult task than lying
               | 
               | No, it's not - you don't even need to be literate to
               | count symbols - but also consider the complexity of the
               | second task and how many skills each requires: unlike
               | counting letters, lying isn't simple confabulation and
               | requires a theory of mind and some kind of goal. A child
               | who lies to avoid trouble is doing that because they have
               | enough of a world model to know they are going to get in
               | trouble for something even if they haven't worked out yet
               | that this is unlikely to work.
        
               | anal_reactor wrote:
               | Sure, let's stick to counting symbols. When I need to
               | count something, there's a decent chance I'll get lost if
               | I count beyond 10, and beyond 20 I'll get lost for sure.
               | Even below 10, when I count it's one-two-three-four-five-
               | six-seven-eight-nine items. But when I lie I do it
               | instantaneously, without altering the pace of the
               | conversation. I can come up with a believable lie within
               | the brief period between someone saying something to me,
               | and the moment I'm expected to respond. No way I'd be
               | able to count 10 items that fast.
               | 
               | Piraha language doesn't even have numerals - that's an
               | extreme case, but there quite a few languages where
               | people stop counting beyond certain small number and just
               | say "a lot". Same people though don't have issues lying
               | to one another. Let that sink in for a while - fully
               | grown-ass adults, fully capable of functioning in their
               | society, not capable of counting one-two-three because
               | the concept is beyond them.
               | 
               | What I'm trying to say is that all of those "requires
               | theory of mind" statements are probably true but
               | completely irrelevant because humans (and LLMs) have
               | "hardware acceleration" of whatever it takes to lie,
               | meanwhile counting is an abstract idea that requires to
               | use the brain in a way it didn't evolve to be used.
               | Similarly, LLMs cannot count if they aren't connected to
               | a math engine - not because they're stupid, but because
               | counting is really difficult.
        
         | krackers wrote:
         | ARC-AGI v3 is a pretty good benchmark, and it's notably
         | different from the other ARC-AGI in that it has a "truer" human
         | baseline (you can go play it right now and add your datapoint),
         | and captures the act of in-context learning better as you start
         | an unfamiliar game then master it over time.
         | 
         | Also bottom 10% feels like a bad comparison, median human would
         | be better. And unlike "specialized" things like programming,
         | game playing is something almost all of us have done.
        
       | My_Name wrote:
       | I find that they know what they know fairly well, but if you move
       | beyond that, into what can be reasoned from what they know, they
       | have a profound lack of ability to do that. They are good at
       | repeating their training data, not thinking about it.
       | 
       | The problem, I find, is that they then don't stop, or say they
       | don't know (unless explicitly prompted to do so) they just make
       | stuff up and express it with just as much confidence.
        
         | ftalbot wrote:
         | Every token in a response has an element of randomness to it.
         | This means they're non-deterministic. Even if you set up
         | something within their training data there is some chance that
         | you could get a nonsense, opposite, and/or dangerous result.
         | The chance of that may be low because of things being set up
         | for it to review its result, but there is no way to make a non-
         | deterministic answer fully bound to solving or reasoning
         | anything assuredly, given enough iterations. It is designed to
         | be imperfect.
        
           | yuvalr1 wrote:
           | You are making a wrong leap from non-deterministic process to
           | uncontrollable result. Most of the parallel algorithms are
           | non-deterministic. There might be no guarantee about the
           | order of calculation or even sometimes the final absolute
           | result. However, even when producing different final results,
           | the algorithm can still guarantee characteristics about the
           | result.
           | 
           | The hard problem then is not to eliminate non-deterministic
           | behavior, but find a way to control it so that it produces
           | what you want.
        
             | flavaflav2 wrote:
             | Life and a lot in our universe is non-deterministic. Some
             | people assume science and mathematics are some universal
             | truths rather than imperfect agreed upon understandings.
             | Similarly many assume humans can be controlled through
             | laws, penalties, prisons, propaganda, coercion, etc. But
             | terrible things happen. Yes, if you set up the gutter-rails
             | in your bowling lane, you can control the bowling ball
             | unless it is thrown over those rails or in a completely
             | different direction, but those rails are wide with LLMs by
             | default, and the system instructions provided it aren't
             | rules, they are an inherently faulty way to coerce a non-
             | deterministic system. But, yes, if there's absolutely no
             | way to do something, and you're aware of every possible way
             | a response or tool could affect things, and you have taken
             | every possible precaution, you can make it behave. That's
             | not how people are using it though, and we cannot control
             | our tendency to trust that which seems trustworthy even if
             | we are told these things.
        
               | squidbeak wrote:
               | No, Science is a means of searching for those truths -
               | definitely not some 'agreed upon understanding'. It's
               | backed up by experimentation and reproducible proofs. You
               | also make a huge bogus leap from science to humanities.
        
               | iq176 wrote:
               | Scientific method is the process. Science itself includes
               | the study and compendium of understandings, based on a
               | belief system that includes shared understandings just
               | like mathematics. The foundation of these are
               | philosophical beliefs that we can know and understand
               | these things. For example, on a metaphysical level, if
               | the world around us were a simulation, then science could
               | provide understandings about that simulated universe, but
               | not about that which is simulating it.
        
               | squidbeak wrote:
               | This I'm afraid is rubbish. Scientific proofs
               | categorically don't depend on philosophical beliefs.
               | Reality is measurable and the properties measured don't
               | care about philosophy.
        
               | weltensturm wrote:
               | > Reality is measurable
               | 
               | Heisenberg would disagree.
        
               | squidbeak wrote:
               | Are you arguing that the uncertainty principle derives
               | from philosophy rather than math?
        
               | darkwater wrote:
               | But those are still approximations to the actual
               | underlying reality. Because the other option (and yes,
               | it's a dichotomy) is that we already defined and
               | understood every detail of the physics that applies to
               | our universe.
        
               | squidbeak wrote:
               | Indeed, that is a dichotomy: a false one. Science is
               | exact without being finished.
        
               | darkwater wrote:
               | So, was Newtonian physics exact already?
        
               | squidbeak wrote:
               | > Science is exact without being finished
        
               | darkwater wrote:
               | Being exact doesn't mean it is not an approximation,
               | which was the initial topic. Being exact in science means
               | that 2+2=4 and that can be demonstrated following a
               | logical chain. But that doesn't make our knowledge of the
               | universe exact. It is still an approximation. What it can
               | be "exact" is how we obtain and reproduce the current
               | knowledge we have of it.
        
               | squidbeak wrote:
               | The speed of light, or plank's constant - are these
               | approximations?
        
           | mannykannot wrote:
           | There seems to be more to it than that - in my experience
           | with LLMs, they are good at finding some relevant facts but
           | then quite often present a non-sequitur for a conclusion, and
           | the article's title alone indicates that the problem for LRMs
           | is similar: a sudden fall-off in performance as the task gets
           | more difficult. If the issue was just non-determinism, I
           | would expect the errors to be more evenly distributed, though
           | I suppose one could argue that the sensitivity to non-
           | determinism increases non-linearly.
        
           | squidproquo wrote:
           | The non-determinism is part of the allure of these systems --
           | they operate like slot machines in a casino. The dopamine hit
           | of getting an output that appears intelligent and the
           | variable rewards keeps us coming back. We down-weight and
           | ignore the bad outputs. I'm not saying these systems aren't
           | useful to a degree, but one should understand the statistical
           | implications on how we are collectively perceiving their
           | usefulness.
        
           | galaxyLogic wrote:
           | > Every token in a response has an element of randomness to
           | it.
           | 
           | I haven't tried this, but so if you ask the LLM the exact
           | same question again, but in a different process, will you get
           | a different answer?
           | 
           | Wouldn't that mean we should mosr of the time ask the LLM
           | each question multiple times, to see if we get a better
           | answer next time?
           | 
           | A bit like asking the same question from multiple different
           | LLMs just to be sure.
        
         | PxldLtd wrote:
         | I think a good test of this seems to be to provide an image and
         | get the model to predict what will happen next/if x occurs.
         | They fail spectacularly at Rube-Goldberg machines. I think
         | developing some sort of dedicated prediction model would help
         | massively in extrapolating data. The human subconscious is
         | filled with all sorts of parabolic prediction, gravity,
         | momentum and various other fast-thinking paths that embed these
         | calculations.
        
           | yanis_t wrote:
           | Any example of that? One would think that predicting what
           | comes next from an image is basically video generation, which
           | works not perfect, but works somehow (Veo/Sora/Grok)
        
             | PxldLtd wrote:
             | Here's one I made in Veo3.1 since gemini is the only
             | premium AI I have access to.
             | 
             | Using this image - https://www.whimsicalwidgets.com/wp-
             | content/uploads/2023/07/... and the prompt: "Generate a
             | video demonstrating what will happen when a ball rolls down
             | the top left ramp in this scene."
             | 
             | You'll see it struggles - https://streamable.com/5doxh2 ,
             | which is often the case with video gen. You have to
             | describe carefully and orchestrate natural feeling motion
             | and interactions.
             | 
             | You're welcome to try with any other models but I suspect
             | very similar results.
        
               | chamomeal wrote:
               | I love how it still copies the slow pan and zoom from
               | rube goldberg machine videos, but it's just following
               | along with utter nonsense lol
        
               | galaxyLogic wrote:
               | A Goldbergs machine was not part of their training data.
               | For humans, we have seem such things.
        
               | autoexec wrote:
               | physics textbooks are though so it should know how they'd
               | work, or at least know that balls don't spontaneously
               | appear and disappear and that gears don't work when they
               | aren't connected
        
             | mannykannot wrote:
             | It is video generation, but succeeding at this task
             | involves detailed reasoning about cause and effect to
             | construct chains of events, and may not be something that
             | can be readily completed by applying "intuitions" gained
             | from "watching" lots of typical movies, where most of the
             | events are stereotypical.
        
           | pfortuny wrote:
           | Most amazing is asking any of the models to draw an 11-sided
           | polygon and number the edges.
        
             | Torkel wrote:
             | I asked gpt5, and it worked really well with a correct
             | result. Did you expect it to fail?
        
         | pistoriusp wrote:
         | I saw a meme that I think about fairly often: Great apes have
         | learnt sign language, and communicated with humans, since the
         | 1960's. In all that time they've never asked human questions.
         | They've never tried to learn anything new! The theory is that
         | they don't know that there are entities that know things they
         | don't.
         | 
         | I like to think that AI are the great apes of the digital
         | world.
        
           | 20k wrote:
           | Its worth noting that the idea that great apes have learnt
           | sign language is largely a fabrication by a single person,
           | and nobody has ever been able to replicate this. All the
           | communication has to be interpreted through that individual,
           | and anyone else (including people that speak sign language)
           | have confirmed that they're just making random hand motions
           | in exchange for food
           | 
           | They don't have the dexterity to really sign properly
        
             | krapht wrote:
             | Citation needed.
        
               | joncrocks wrote:
               | https://en.wikipedia.org/wiki/Great_ape_language#Criticis
               | m_a... - Not word for word, but certainly casting doubt
               | that apes were ever really communicating in the way that
               | people may have thought.
        
               | mkl wrote:
               | That article does completely refute 20k's claim that it
               | was all done by one person though.
        
               | MangoToupe wrote:
               | The way linguists define communication via language?
               | Sure. Let's not drag the rest of humanity into this
               | presumption.
        
               | conception wrote:
               | Searching for koko ape fraud seems to produce a lot.
        
               | ralfd wrote:
               | > In his lecture, Sapolsky alleges that Patterson
               | spontaneously corrects Koko's signs: "She would ask,
               | 'Koko, what do you call this thing?' and [Koko] would
               | come up with a completely wrong sign, and Patterson would
               | say, 'Oh, stop kidding around!' And then Patterson would
               | show her the next one, and Koko would get it wrong, and
               | Patterson would say, 'Oh, you funny gorilla.' "
               | 
               | More weirdly was this lawsuit against Patterson:
               | 
               | > The lawsuit alleged that in response to signing from
               | Koko, Patterson pressured Keller and Alperin (two of the
               | female staff) to flash the ape. "Oh, yes, Koko, Nancy has
               | nipples. Nancy can show you her nipples," Patterson
               | reportedly said on one occasion. And on another: "Koko,
               | you see my nipples all the time. You are probably bored
               | with my nipples. You need to see new nipples. I will turn
               | my back so Kendra can show you her nipples."[47] Shortly
               | thereafter, a third woman filed suit, alleging that upon
               | being first introduced to Koko, Patterson told her that
               | Koko was communicating that she wanted to see the woman's
               | nipples
               | 
               | There was a bonobo named Kanzi who learned hundreds of
               | lexigrams. The main criticism here seems to be that while
               | Kanzi truly did know the symbol for "Strawberry" he "used
               | the symbol for "strawberry" as the name for the object,
               | as a request to go where the strawberries are, as a
               | request to eat some strawberries". So no object-verb
               | sentences and so no grammar which means no true language
               | according to linguists.
               | 
               | https://linguisticdiscovery.com/posts/kanzi/
        
               | galaxyLogic wrote:
               | > So no object-verb sentences and so no grammar which
               | means no true language
               | 
               | Great distinction. The stuff about showing nipples sounds
               | creepy.
        
               | pegasus wrote:
               | You only need a citation for the idea that apes _aren 't_
               | able to speak sign language?
        
               | acdha wrote:
               | They claimed fraud by a single person, with zero
               | replication. That's both testable so they should be able
               | to support it.
               | 
               | At the very least, more than one researcher was involved
               | and more than one ape was alleged to have learned ASL.
               | There is a better discussion about what our threshold is
               | for speech, along with our threshold for saying that
               | research is fraud vs. mistaken, but we don't fix
               | sloppiness by engaging in more of it.
        
               | galaxyLogic wrote:
               | SO why wasn't the research continued further if results
               | were good? My assumption is it was because of the - Fear
               | of the Planet of Apes!
        
             | rightbyte wrote:
             | I mean dogs can learn a simple sign language?
        
               | leptons wrote:
               | Can the dogs sign back? Even dogs that learn to press
               | buttons are mostly just pressing them to get treats. They
               | don't ask questions, and it's not really a conversation.
        
           | BOOSTERHIDROGEN wrote:
           | Does that means intelligent is soul? Then we will never
           | achieve AGI.
        
           | MangoToupe wrote:
           | > The theory is that they don't know that there are entities
           | that know things they don't.
           | 
           | This seems like a rather awkward way of putting it. They may
           | just lack conceptualization or abstraction, making the above
           | statement meaningless.
        
             | sodality2 wrote:
             | The exact title of the capacity is 'theory of mind' - for
             | example, chimpanzees have a limited capacity for it in that
             | they can understand others' intentions, but they seemingly
             | do not understand false beliefs (this is what GP
             | mentioned).
             | 
             | https://doi.org/10.1016/j.tics.2008.02.010
        
               | MangoToupe wrote:
               | Theory of mind is a distinct concept that isn't necessary
               | to explain this behavior. Of course, it may follow
               | naturally, but it strikes me as ham-fisted projection of
               | our own cognition onto others. Ironically, a rather
               | greedy theory of mind!
        
               | galaxyLogic wrote:
               | If apes started communicating mongs themselves with sign-
               | language they learned from humans that would measn they
               | would get more practice using it and they could evolve it
               | over aeons. Hey, isn't that what actually happened?
        
         | pimeys wrote:
         | I just got this from codex yesterday:
         | 
         | "I wasn't able to finish; no changes were shipped."
         | 
         | And it's not the first time.
        
           | conception wrote:
           | This is my most impressive response from Codex "* I started
           | wiring up module-scoped mocks as requested, but while digging
           | in I realised we're sitting on a moving target: the repo
           | already has a large pile of pending, unrelated changes (whole
           | modules reshuffled, new ps1 files, etc.), and the
           | Enhanced.WorkflowOrchestration.Tests.ps1 under review isn't
           | even tracked in the baseline branch. Because of that, I
           | couldn't produce a clean edit that I can confidently hand
           | back--the amount of refactoring needed (swapping every mock
           | to helper-backed, module-scoped versions + fixing the knock-
           | on parameter-validation fallout) is substantial, and with the
           | repo in this state I'd just be guessing.
           | Rather than check in something half-broken, I'm pausing here.
           | Let me know how you want to       proceed--if you can land
           | the upstream refactor (or share a stable snapshot of the
           | tests/module),       I can pick this up again and finish the
           | review fixes in one go."
        
           | darkwater wrote:
           | Have you threatened it with a 2 in the next round of
           | performance reviews?
        
         | amelius wrote:
         | The problem is that the training data doesn't contain a lot of
         | "I don't know".
        
           | pegasus wrote:
           | The bigger problem is that the benchmarks / multiple-choice
           | tests they are trained to optimize for don't distinguish
           | between a wrong answer and "I don't know". Which is stupid
           | and surprising. There was a thread here on HN about this
           | recently.
        
           | astrange wrote:
           | That's not important compared to the post-training RL, which
           | isn't "training data".
        
         | usrbinbash wrote:
         | > They are good at repeating their training data, not thinking
         | about it.
         | 
         | Which shouldn't come as a surprise, considering that this is,
         | at the core of things, what language models do: Generate
         | sequences that are statistically likely according to their
         | training data.
        
           | dymk wrote:
           | This is too large of an oversimplification of how an LLM
           | works. I hope the meme that they are just next token
           | predictors dies out soon, before it becomes a permanent
           | fixture of incorrect but often stated "common sense". They're
           | not Markov chains.
        
             | adastra22 wrote:
             | They are next token predictors though. That is literally
             | wha they are. Nobody is saying they are simple Markov
             | chains.
        
               | dymk wrote:
               | It's a uselessly reductive statement. A person at a
               | keyboard is also a next token predictor, then.
        
               | HarHarVeryFunny wrote:
               | Yes, but it's not ALL they are.
        
               | daveguy wrote:
               | They are both designed, trained, and evaluated by how
               | well they can predict the next token. It's literally what
               | they do. "Reasoning" models just buildup additional
               | context of next token predictions and RL is used to bias
               | output options to ones more appealing to human judges.
               | It's not a meme. It's an accurate description of their
               | fundamental computational nature.
        
             | gpderetta wrote:
             | Indeed, they are next token predictors, but this is a
             | vacuous statement because the predictor can be arbitrary
             | complex.
        
               | HarHarVeryFunny wrote:
               | Sure, but a complex predictor is still a predictor. It
               | would be a BAD predictor if everything it output was not
               | based on "what would the training data say?".
               | 
               | If you ask it to innovate and come up with something not
               | in it's training data, what do you think it will do ....
               | it'll "look at" it's training data and regurgitate
               | (predict) something labelled as innovative
               | 
               | You can put a reasoning cap on a predictor, but it's
               | still a predictor.
        
         | Workaccount2 wrote:
         | To be fair, we don't actually know what is and isn't in their
         | training data. So instead we just assign successes to "in the
         | training set" and failures to "not in the training set".
         | 
         | But this is unlikely, because they still can fall over pretty
         | badly on things that are definitely in the training set, and
         | still can have success with things that definitely are not in
         | the training set.
        
       | nakamoto_damacy wrote:
       | LLMs falter because likelihood-driven pattern completion doesn't
       | enforce coherence across uncertainty (probability),
       | representation (geometry), composition (category), and search
       | (reasoning). To get robust reasoning, we need these layers to be
       | explicit, typed, and mutually constraining--with verification and
       | calibrated belief updates in the loop.
       | 
       | I was interviewed about this recently, and mentioned the great
       | work of a professor of CS and Law who has been building the
       | foundations for this approach. My own article about it was
       | recently un-linked due to a Notion mishap (but available if
       | anyone is interested - I have to publish it again)
       | 
       | https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...
        
         | CuriouslyC wrote:
         | Richard Sutton's interview on Dwarkesh's podcast hit at this
         | same point. The implicit world models in LLMs are insufficient.
        
           | jampekka wrote:
           | Sutton still hasn't learned his own Bitter Lesson? ;)
        
             | creativeSlumber wrote:
             | what do you mean?
        
               | nakamoto_damacy wrote:
               | Not sure why he capitalized bitter...
        
               | jampekka wrote:
               | It was a joke referring to his essay.
               | 
               | https://en.wikipedia.org/wiki/Bitter_lesson
        
       | hirako2000 wrote:
       | Has any one ever found an ML/AI paper that make claims that RLMs
       | can reason?
       | 
       | When I prompt an RLM, I can see it spits out reasoning steps. But
       | I don't find that evidence RLMs are capable of reasoning.
        
         | Sharlin wrote:
         | Semantics schemantics.
        
           | hirako2000 wrote:
           | It's a statistical imitation of a reasoning pattern,
           | underlying mechanism is pattern matching. The ability to
           | create a model that can determine two radically different
           | words have strong similarity in meaning doesn't imply
           | emergence of some generalizable, logical model that suddenly
           | can Reason to solve novel problems.
           | 
           | Pattern matching is a component of reason. Not === reason.
        
         | _heimdall wrote:
         | That would require the ability to understand what happens
         | inside the system during inference when the output is created
         | and they can't do that today.
         | 
         | There's no evidence to be had when we only know the inputs and
         | outputs of a black box.
        
         | tempfile wrote:
         | I don't understand what point you are making. Doesn't the name
         | "Reasoning language models" claim that they can reason? Why do
         | you want to see it explicitly written down in a paper?
        
           | hirako2000 wrote:
           | This very paper sits on the assumption reasoning (to solve
           | puzzles) is at play. It calls those LLMs RLMs.
           | 
           | Imo the paper itself should have touched on the lack of paper
           | discussing what's in the blackbox that makes them Reasoning
           | LMs. It does mention some tree algorithm supposedly key to
           | reasoning capabilities.
           | 
           | By no means attacking the paper as its intent is to
           | demonstrate the lack of success to even solve simple to
           | formulate, complex puzzles.
           | 
           | I was not making a point, I was genuinely asking in case
           | someone knows of papers I could read on that make claims with
           | evidence that's those RLM actually reason, and how.
        
           | tekno45 wrote:
           | By renaming this binary to a "Mind reading language model" We
           | now can read your mind and predict your choices just by
           | chatting.
           | 
           | Don't ask how it works cuz its called a "Mind reading
           | language model" duh.
        
       | egberts1 wrote:
       | It's simple. Don't ingest more than 40KB at a time into its LLM's
       | RAG pipe and its hallucination goes way, way down.
       | 
       | Preferably like not at the start and best not to do more than
       | 40KB at a time at all.
       | 
       | That's how I learned how to deal with nftables' 120KB
       | parser_bison.y file by breaking them up into clean sections.
       | 
       | All of a sudden, a fully-deterministic LL(1) full semantic
       | pathway of nftables' CLI syntax appears before my very eye (and
       | spent hours validating it): 100% and test generators now can
       | permutate crazy test cases with relative ease.
       | 
       | Cue in Joe Walsh's "Life's Been Good To Me".
        
         | bob_theslob646 wrote:
         | Why 40kb?
        
           | igravious wrote:
           | and doesn't it depend on the LLM?
        
             | egberts1 wrote:
             | If you have your Pro or private LLM, then it's a tad bit
             | bigger.
        
           | egberts1 wrote:
           | Cheap public offering of their expensive data center is that
           | sweet spot and cutoff at 40KB.
        
       | lingrush4 wrote:
       | Is that really the best title the authors could come up with?
       | 
       | Up next: "Lawn mowers are good at cutting grass until they
       | aren't"
        
         | andy99 wrote:
         | I think that would be a good title if we'd previously thought
         | lawn mowers had solved generalized grass cutting and assumed
         | that because one worked on my lawn that they could cut
         | hayfields or harvest bamboo (a grass I believe) effectively.
        
         | tekno45 wrote:
         | When the news cycle has been "lawnmowers can now do anything,
         | throw away your kitchenaide" its a pretty relevant title.
        
       | moritzwarhier wrote:
       | From the abstract:
       | 
       | > some even claiming they are capable of generalized reasoning
       | and innovation in reasoning-intensive fields such as mathematics,
       | physics, medicine, and law. However, by more carefully scaling
       | the complexity of reasoning problems, we show existing benchmarks
       | actually have limited complexity
       | 
       | Can someone ELI5 what the definitions of reasoning and complexity
       | are here?
       | 
       | I see they seem to focus on graph problems and representing
       | problems as graph problems. But I didn't completely read the
       | paper or understand it in depth. I skimmed some parts that seem
       | to address this question (e.g. section 5 and the Introduction),
       | but maybe there are simpler definitions that elude me.
       | 
       | Surely they don't mean "computational complexity"?
       | 
       | And what exactly is "reasoning"?
       | 
       | I'm aware of philosophical logic and strict logic that can be
       | applied to natural language arguments.
       | 
       | But have we already agreed on a universal scale that grades
       | answers to questions about the physical world? Or is this about
       | mathematical reasoning?
       | 
       | Mixing all of this together always irks me when it comes to these
       | AI "benchmarks". But apparently people see value in these?
       | 
       | I know my question isn't new.
       | 
       | To me it seems, that when we leave the mathematical realms, it
       | quickly becomes fuzzy what correct "reasoning" should be.
       | 
       | People can be convincing and avoid obious logical fallacies, and
       | still make wrong conclusions... or conclusions that run counter
       | to assumed goals.
        
         | dcre wrote:
         | Even in the mathematical/formal realm, the meaning of reasoning
         | is not as clear as it seems. The _result_ of the activity of
         | reasoning may be a formal argument that can be evaluated
         | according to well-defined rules, but the actual process your
         | mind went through to get there is just as opaque (or more) as
         | whatever is going on inside LLMs. It seems likely, as you
         | suggest, that we are going to have to define reasoning in terms
         | of ability to solve certain classes of problems but leaving the
         | character of the process unspecified.
        
       | kordlessagain wrote:
       | What specific reasoning capabilities matter for what real-world
       | applications?
       | 
       | Nobody knows.
       | 
       | Moreover, nobody talks about that because it's boring and non-
       | polarizing. Instead, supposedly smart people post stupid comments
       | that prevent anyone from understanding this paper is worthless.
       | 
       | The paper is worthless because it has a click-bait title. Blog
       | posts get voted down for that, why not this?
       | 
       | The implicit claim is worthless. Failure to navigate a synthetic
       | graph == failure to solve real world problems. False.
       | 
       | Absolutely no connection to real world examples. Just losing the
       | model in endless graphs.
        
         | wavemode wrote:
         | > The implicit claim is worthless. Failure to navigate a
         | synthetic graph == failure to solve real world problems. False.
         | 
         | This statement is the dictionary definition of attacking a
         | strawman.
         | 
         | Every new model that is sold to us, is sold on the basis that
         | it performs better than the old model on synthetic benchmarks.
         | This paper presents a different benchmark that those same LLMs
         | perform much worse on.
         | 
         | You can certainly criticize the methodology if the authors have
         | erred in some way, but I'm not sure why it's hard to understand
         | the relevance of the topic itself. If benchmarks are so
         | worthless then go tell that to the LLM companies.
        
       | riskable wrote:
       | My hypothesis: This is why AI is fantastic as a coding assistant
       | but not so great at other things. A software developer--after
       | watching an AI model fail over and over again, trying to say, fix
       | a difficult bug--will stop and approach the issue from a
       | different angle. They'll take a closer look at what's going on,
       | fiddle things around by hand, and that's usually enough to get
       | over that hump of complexity (that the AI model couldn't work its
       | way through).
       | 
       | We (developers) do this because it's what we've _always_ done
       | with our own code. Everyone 's encountered a bug that they just
       | couldn't figure out. So they search the Internet, try different
       | implementations of the same thing, etc but nothing works.
       | Usually, we finally solve such problems when we take a step back
       | and look at it with a different lens.
       | 
       | For example, just the other day--after spending far too long
       | trying to get something working--I realized, "Fuck it! The users
       | don't really need this feature." :thumbsup:
        
         | acuozzo wrote:
         | > AI is fantastic as a coding assistant
         | 
         | The extent to which this is true is a rough measure of how
         | derivative your work is, no?
        
       | dankai wrote:
       | This is not the only paper that scales reasoning complexity /
       | difficulty.
       | 
       | The CogniLoad benchmark does this as well (in addition to scaling
       | reasoning length and distractor ratio). Requiring the LLM to
       | purely reason based on what is in the context (i.e. not based on
       | the information its pretrained on), it finds that reasoning
       | performance decreases significantly as problems get harder (i.e.
       | require the LLM to hold more information in its hidden state
       | simultaneously), but the bigger challenge for them is length.
       | 
       | https://arxiv.org/abs/2509.18458
       | 
       | Disclaimer: I'm the primary author of CogniLoad so feel free to
       | ask me any questions.
        
       | kerabatsos wrote:
       | How is that different than human reasoning?
        
         | ares623 wrote:
         | I'd like $500B to just be the way I am thanks.
        
       | j45 wrote:
       | Compared to software that can explicitly reason, reasoning models
       | don't seem to reason at all.
       | 
       | They simulate reasoning through matching patterns.
        
       ___________________________________________________________________
       (page generated 2025-10-31 23:01 UTC)