[HN Gopher] Reasoning models reason well, until they don't
___________________________________________________________________
Reasoning models reason well, until they don't
Author : optimalsolver
Score : 200 points
Date : 2025-10-31 09:23 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| iLoveOncall wrote:
| > [...] recent studies show that transformers and LLMs fail
| catastrophically once reasoning problems exceed modest
| complexity. We revisit these findings through the lens of large
| reasoning models (LRMs) -- LLMs fine-tuned with incentives for
| step-by-step argumentation and self-verification
|
| This was the obvious outcome of the study (don't get me wrong,
| obvious outcomes are still worth having research on).
|
| "LRMs" *are* just LLMs. There's no such thing as a reasoning
| model, it's just having an LLM write a better prompt than the
| human would and then sending it to the LLM again.
|
| Despite what Amodei and Altman want Wall Street to believe, they
| did not suddenly unlock reasoning capabilities in LLMs by
| essentially just running two different prompts in sequence to
| answer the user's question.
|
| The truly amazing thing is that reasoning models show ANY
| improvement at all compared to non-reasoning models, when they're
| the same exact thing.
| sothatsit wrote:
| What do you mean by reasoning?
|
| If you mean solving logic problems, then reasoning LLMs seem to
| pass that bar as they do very well programming and maths
| competitions. Reasoning LLMs can also complete problems like
| multiplying large numbers, which requires applying some sort of
| algorithm where the results cannot just be memorised. They also
| do this much better than standard pre-trained LLMs with no RL.
|
| So, that makes me come back to this question of what definition
| of reasoning do people use that reasoning models do not meet?
| They're not perfect, obviously, but that is not a requirement
| of reasoning if you agree that humans can reason. We make
| mistakes as well, and we also suffer under higher complexity.
| Perhaps they are less reliable in knowing when they have made
| mistakes or not than trained humans, but I wouldn't personally
| include reliability in my definition for reasoning (just look
| at how often humans make mistakes in tests).
|
| I am yet to see any serious, reasoned, arguments that suggest
| why the amazing achievements of reasoning LLMs in maths and
| programming competitions, on novel problems, does not count as
| "real reasoning". It seems much more that people just don't
| like the idea of LLMs reasoning, and so reject the idea without
| giving an actual reason themselves, which seems somewhat ironic
| to me.
| fsloth wrote:
| I guess we mean here "usefull reasoning" instead of the
| idiot-savant. I mean it's a fair ask since these are marketed
| as _tools_ you can use to implement _industrial processes_
| and even replace you human workers.
|
| In that I guess the model does not need to be the most
| reasonable intepreter of vague and poorly formulated user
| inputs but I think to improve a bit at least, to become
| usefull general appliances and not just test-scoring-
| automatons.
|
| The key differentiator here is that tests generally _are made
| to be unambiguously scoreable_. Real world problems are often
| more vague from the point of view of optimal outcome.
| sothatsit wrote:
| Thanks. So, people are extending "reasoning" to include
| making good decisions, rather than just solving logic
| problems. That makes sense to me that if people use that
| definition, LLMs are pretty bad at "reasoning".
|
| Although, I would argue that this is not reasoning at all,
| but rather "common sense" or the ability to have a broader
| perspective or think of the future. These are tasks that
| come with experience. That is why these do not seem like
| reasoning tasks to me, but rather soft skills that LLMs
| lack. In my mind these are pretty separate concerns to
| whether LLMs can logically step through problems or apply
| algorithms, which is what I would call reasoning.
| hansmayer wrote:
| Ah yes then, let me then unchain my LLM on those nasty
| unsolved math and logic problems I've absolutely not be
| struggling with in the course of my career.
| sothatsit wrote:
| A lot of maths students would also struggle to contribute
| to frontier math problems, but we would still say they
| are reasoning. Their skill at reasoning might not be as
| good as professional mathematicians, but that does not
| stop us from recognising that they can solve logic
| problems without memorisation, which is a form of
| reasoning.
|
| I am just saying that LLMs have demonstrated they can
| reason, at least a little bit. Whereas it seems other
| people are saying that LLM reasoning is flawed, which
| does not negate the fact that they can reason, at least
| some of the time.
|
| Maybe generalisation is one area where LLM's reasoning is
| weakest though. They can be near-elite performance at
| nicely boxed up competition math problems, but their
| performance dramatically drops on real-world problems
| where things aren't so neat. We see similar problems in
| programming as well. I'd argue the progress on this has
| been promising, but other people would probably
| vehemently disagree with that. Time will tell.
| vidarh wrote:
| Thank you for picking at this.
|
| A lot of people appear to be - often not consciously or
| intentionally - setting the bar for "reasoning" at a
| level many or most people would not meet.
|
| Sometimes that is just a reaction to wanting an LLM that
| is producing result that is good for their own level.
| Sometimes it reveals a view of fellow humans that would
| be quite elitist if stated outright. Sometimes it's a
| kneejerk attempt at setting the bar at a point that would
| justify a claim that LLMs aren't reasoning.
|
| Whatever the reason, it's a massive pet peeve of mine
| that it is rarely made explicit in these conversations,
| and it makes a lot of these conversations pointless
| because people keep talking past each other.
|
| For my part a lot of these models often clearly reason by
| my standard, _even if poorly_. People also often reason
| poorly, even when they demonstrably attempt to reason
| step by step. Either because they have motivations to
| skip over uncomfortable steps, or because they don 't
| know how to do it right. But we still would rarely claim
| they are not capable of reasoning.
|
| I wish more evaluations of LLMs would establish a human
| baseline to test them against for much this reason. It
| would be illuminating in terms of actually telling us
| more about how LLMs match up to humans in different
| areas.
| cryptonym wrote:
| Computers have forever been doing stuff people can't do.
|
| The real question is how useful this tool is and if this
| is as transformative as investors expect. Understanding
| its limits is crucial.
| cryptonym wrote:
| That's the real deal.
|
| They say LLM are PhD-level. Despite billion dollars, PhD-
| LLMs sure are not contributing a lot solving known
| problems. Except of course few limited marketing stunts.
| fsloth wrote:
| IMHO that's the key differentiator.
|
| You can give a human PhD an _unsolved problem_ in field
| adjacent to their expertise and expect some reasonable
| resolution. LLM PhD:s solve only known problems.
|
| That said humans can also be really bad problem solvers.
|
| If you don't care about solving the problem and only want
| to create paperwork for bureaucracy I guess you don't
| care either way ("My team's on it!") but companies that
| don't go out of business generally recognize pretty soon
| lack of outcomes where it matters.
| nl wrote:
| > LLM PhD:s solve only known problems.
|
| Terry Tao would disagree:
| https://mathstodon.xyz/@tao/114508029896631083
|
| https://deepmind.google/discover/blog/alphaevolve-a-
| gemini-p...
| hansmayer wrote:
| I wish our press was not effectively muted or bought by
| the money, so none of the journos has cojones to call out
| the specific people who were blabbing about PhD-levels,
| AGI etc. They should be god damn calling them out every
| single day, essentially doing their job, but they are now
| too timid for that.
| vidarh wrote:
| I've "unchained" my LLM on a lot of problems that I
| probably _could_ solve, but that would take me time I don
| 't have, and that it has solved in many case faster than
| I could. It may not be good enough to solve problems that
| are _beyond_ us for most of us, but it certainly can
| solve a lot of problems for a lot of us that have gone
| unsolved for lack of resources.
| cryptonym wrote:
| Can solve problems you already know how to solve, if you
| micro-manage it and it'll BS a lot on the way.
|
| If this is the maximum AGI-PhD-LRM can do, that'll be
| disappointing compared to investments. Curious to see
| what all this will become in few years.
| vidarh wrote:
| I'm not usually micro-managing it, that's the point.
|
| I _sometimes_ do on problems where I have particular
| insight, but I mostly find it is _far more effective_ to
| give it test cases and give it instructions on how to
| approach a task, and then _let it iterate_ with little to
| no oversight.
|
| I'm letting Claude Code run for longer and longer with
| --dangerously-skip-permissions, to the point I'm
| pondering rigging up something to just keep feeding it
| "continue" and run it in parallel on multiple problems.
|
| Because at least when you have a good way of measuring
| success, it works.
| hansmayer wrote:
| ^^This is a great view and it seems generally widely
| understood by the file and rank techies. I feel pitty for
| the general public retail investors which are about to be
| left holding the bag for the VCs, after a certain major
| <ahem> champion goes into IPO soon.
| js8 wrote:
| > So, that makes me come back to this question of what
| definition of reasoning do people use that reasoning models
| do not meet?
|
| The models can learn reasoning rules, but they are not able
| to apply them consistently or recognize the rules they have
| learned are inconsistent. (See also my other comment which
| references comments I made earlier.)
|
| And I think they can't without a tradeoff, as I commented
| https://news.ycombinator.com/item?id=45717855 ; the
| consistency requires certain level of close-mindedness.
| sothatsit wrote:
| Yes, so I think in this case we use different definitions
| of reasoning. You include reliability as a part of
| reasoning, whereas I do not.
|
| I would argue that humans are not 100% reliable in their
| reasoning, and yet we still claim that they can reason. So,
| even though I would agree that the reasoning of LLMs is
| much less reliable, careful, and thoughtful than smart
| humans, that does not mean that they are not reasoning.
| Rather, it means that their reasoning is more unreliable
| and less well-applied than people. But they are still
| performing reasoning tasks (even if their application of
| reasoning can be flawed).
|
| Maybe the problem is that I am holding out a minimum bar
| for LLMs to jump to count as reasoning (demonstrated
| application of logical algorithms to solve novel problems
| in any domain), whereas other people are holding the bar
| higher (consistent and logical application of rules in
| all/most domains).
| js8 wrote:
| The problem is if you're not able to apply the reasoning
| rules consistently, then you will always fail on large
| enough problem. If you have an inconsistent set of
| reasoning rules, then you can set up a problem as a trap
| so that the reasoning fails.
|
| You can argue that damaged toaster is still a toaster,
| conceptually. But if it doesn't work, then it's useless.
| As it stands, models lack ability to reason because they
| can fail to reason and you can't do anything about it. In
| case of humans, it's valid to say they can reason,
| because humans can at least fix themselves, models can't.
| sothatsit wrote:
| The reasoning does not need to be 100% accurate to be
| useful. Humans are rarely 100% accurate at anything, and
| yet over time we can build up large models of problems
| using verification and review. We can do the exact same
| thing with LLMs.
|
| The best example of this is Sean Heelan, who used o3 to
| find a real security vulnerability in the Linux kernel:
| https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-
| cve-...
|
| Sean Heelan ran o3 100 times, and it found a known
| vulnerability in 8% of runs. For a security audit, that
| is immensely useful, since an expert can spend the time
| to look at the results from a dozen runs and quickly
| decide if there is anything real. Even more remarkably
| though, this same testing exposed a zero-day that they
| were not even looking for. That is pretty incredible for
| a system that makes mistakes.
|
| This is why LLM reasoning absolutely does not need to be
| perfect to be useful. Human reasoning is inherently
| flawed as well, and yet through systems like peer review
| and reproducing results, we can still make tremendous
| progress over time. It is just about figuring out systems
| of verification and review so that we don't need to trust
| any LLM output blindly. That said, greater reliability
| would be massively beneficial to how easy it is to get
| good results from LLMs. But it's not required.
| riku_iki wrote:
| > then reasoning LLMs seem to pass that bar as they do very
| well programming and maths competitions.
|
| it could be this is just result of good stochastic parroting
| and not reasoning. Both of those niches are narrow with high
| amount of training data (e.g. corps buying solutions from
| leetcode and training LLMs on them).
|
| From another hand we see that LLMs fail in more complex
| environment: e.g. ask to build some new feature in postgres
| database.
| sothatsit wrote:
| This is clearly false. LLMs being able to multiply large
| numbers is the clear example to me that there is more than
| just memorisation going on. They cannot just memorise the
| answers to multipling huge numbers like they do.
|
| That's not to mention that these programming competition
| problems are designed to be novel. They are as novel as the
| competition designers can get while sticking to the bounds
| of the competition. This is clearly not stochastic parrot
| behaviour.
|
| Additionally, them falling over in large codebases is not
| evidence that they cannot reason over smaller well-defined
| problems. It is just evidence that their reasoning has
| limits, which should not be surprising to anyone. Humans
| also have limits in our reasoning. That does not mean we do
| not reason.
| riku_iki wrote:
| I think you just made lots of handwaving statements. Here
| is result which says LLMs can't do multi-digit
| multiplications well: https://arxiv.org/pdf/2510.00184
| sirwhinesalot wrote:
| > The truly amazing thing is that reasoning models show ANY
| improvement at all compared to non-reasoning models, when
| they're the same exact thing.
|
| It's because they do more compute. The more tokens "spent" the
| better the accuracy. Same reason they spit out a paragraph of
| text instead of just giving a straight answer in non-reasoning
| mode.
| jpcompartir wrote:
| I can't remember which paper it's from, but isn't the variance
| in performance explained by # of tokens generated? i.e. more
| tokens generated tends towards better performance.
|
| Which isn't particularly amazing, as # of tokens generated is
| basically a synonym in this case for computation.
|
| We spend more computation, we tend towards better answers.
| qsort wrote:
| Don't they have a significant RL component? The "we'll just
| make it bigger" idea that was peddled a lot after GPT3.5 was
| nonsense, but that's not the only thing they're doing right
| now.
| ACCount37 wrote:
| "We'll just make it bigger" works. RLVR just gives better
| performance gains and spends less inference compute - as long
| as you have a solid way of verifying the tasks.
|
| A simplified way of thinking about it is: pretraining gives
| LLMs useful features, SFT arranges them into useful
| configurations, RLVR glues them together and makes them work
| together well, especially in long reasoning traces. Makes
| sense to combine it all in practice.
|
| How much pretraining gives an LLM depends on the scale of
| that LLM, among other things. But raw scale is bounded by the
| hardware capabilities and the economics - of training and
| especially of inference.
|
| Scale is still quite desirable - GPT-4.5 scale models are
| going to become the norm for high end LLMs quite soon.
| qsort wrote:
| I'm not against "we'll make it bigger" (although it's as of
| yet unknown if it hits diminishing returns, 4.5 isn't
| exactly remembered as a great release), I'm against "we'll
| _just_ (i.e. 'only') make it bigger".
|
| I'm doubtful you'd have useful LLMs today if labs hadn't
| scaled in post-training.
| antonvs wrote:
| > The truly amazing thing is that reasoning models show ANY
| improvement at all compared to non-reasoning models, when
| they're the same exact thing.
|
| Why is that amazing? It seems expected. Use a tool differently,
| get different results.
| equinox_nl wrote:
| But I also fail catastrophically once a reasoning problem exceeds
| modest complexity.
| monkeydust wrote:
| But you recognise you are likely to fail and thus dont respond
| or redirect the problem to someone who has a greater likelihood
| of not failing.
| antonvs wrote:
| I've had models "redirect the problem to someone who has a
| greater likelihood of not failing". Gemini in particular will
| do this when it runs into trouble.
|
| I don't find all these claims that models are somehow worse
| than humans in such areas convincing. Yes, they're worse in
| some respects. But when you're talking about things related
| to failures and accuracy, they're mostly superhuman.
|
| For example, how many humans can write hundred of lines of
| code (in seconds mind you) and regularly not have any syntax
| errors or bugs?
| ffsm8 wrote:
| > For example, how many humans can write hundred of lines
| of code (in seconds mind you) and regularly not have any
| syntax errors or bugs?
|
| Ez, just use codegen.
|
| Also the second part (not having bugs) is unlikely to be
| true for the LLM generated code, whereas traditional
| codegen will actually generate code with pretty much no
| bugs.
| vidarh wrote:
| I have Claude reducing the number of bugs in my
| traditional codegen right now.
| pessimizer wrote:
| > I've had models "redirect the problem to someone who has
| a greater likelihood of not failing". Gemini in particular
| will do this when it runs into trouble.
|
| I have too, and I sense that this is something that has
| been engineered in rather than coming up naturally. I like
| it very much and they should do it a lot more often.
| They're allergic to "I can't figure this out" but hearing
| "I can't figure this out" gives me the alert to help it
| over the hump.
|
| > But when you're talking about things related to failures
| and accuracy, they're mostly superhuman.
|
| Only if you consider speed to failure and inaccuracy.
| They're very much subhuman in output, but you can make them
| retry a lot in a short time, and refine what you're asking
| them each time to avoid the mistakes they're repeatedly
| making. But that's _you_ doing the work.
| exe34 wrote:
| If that were true, we would live in a utopia. People
| vote/legislate/govern/live/raise/teach/preach without ever
| learning to reason correctly.
| davidhs wrote:
| Do you? Don't you just halt and say this is too complex?
| p_v_doom wrote:
| Nope, audacity and Dunning-Krueger all the way, baby
| dspillett wrote:
| Some would consider that to be failing catastrophically. The
| task is certainly failed.
| carlmr wrote:
| Halting is sometimes preferable to thrashing around and
| running in circles.
|
| I feel like if LLMs "knew" when they're out of their depth,
| they could be much more useful. The question is whether
| knowing when to stop can be meaningfully learned from
| examples with RL. From all we've seen the hallucination
| problem and this stopping problem all boil down to this
| problem that you could teach the model to say "I don't
| know" but if that's part of the training dataset it might
| just spit out "I don't know" to random questions, because
| it's a likely response in the realm of possible responses,
| instead of spitting out "I don't know" to not knowing.
|
| SocratesAI is still unsolved, and LLMs are probably not the
| path to get knowing that you know nothing.
| ukuina wrote:
| > if LLMs "knew" when they're out of their depth, they
| could be much more useful.
|
| I used to think this, but no longer sure.
|
| Large-scale tasks just grind to a halt with more modern
| LLMs because of this perception of impassable complexity.
|
| And it's not that they need extensive planning, the LLM
| knows what needs to be done (it'll even tell you!), it's
| just more work than will fit within a "session"
| (arbitrary) and so it would rather refuse than get
| started.
|
| So you're now looking at TODOs, and hierarchical plans,
| and all this unnecessary pre-work even when the task
| scales horizontally very well (if it just jumped into
| it).
| benterix wrote:
| This seems to be the stance of creators of agentic coders.
| They are so bound on creating something, even if this
| something makes no sense whatsoever.
| LunaSea wrote:
| I would consider that detecting your own limits when trying
| to solve a problem is preferable to having the illusion of
| thinking that your solution is working and correct.
| moritzwarhier wrote:
| Ah yes, the function that halts if the input problem would
| take too long to halt.
|
| But yes, I assume you mean they abort their loop after a
| while, which they do.
|
| This whole idea of a "reasoning benchmark" doesn't sit well
| with me. It seems still not well-defined to me.
|
| Maybe it's just bias I have or my own lack of intelligence,
| but it seems to me that using language models for "reasoning"
| is still more or less a gimmick and convenience feature (to
| automate re-prompts, clarifications etc, as far as possible).
|
| But reading this pop-sci article from summer 2022 seems like
| this definition problem hasn't changed very much since then.
|
| Although it's about AI progress before ChatGPT and it doesn't
| even mention the GPT base models. Sure, some of the tasks
| mentioned in the article seem dated today.
|
| But IMO, there is still no AI model that can be trusted to,
| for example, accurately summarize a Wikipedia article.
|
| Not all humans can do that either, sure. But humans are
| better at knowing what they don't know, and deciding what
| other humans can be trusted. And of course, none of this is
| an arithmetic or calculation task.
|
| https://www.science.org/content/article/computers-ace-iq-
| tes...
| AlecSchueler wrote:
| I also fail catastrophically when trying to push nails through
| walls by I expect my hammer to do better.
| moffkalast wrote:
| I have one hammer and I expect it to work on every nail and
| screw. If it's not a general hammer, what good is it now?
| arethuza wrote:
| You don't need a "general hammer" - they are old fashioned
| - you need a "general-purpose tool-building factory factory
| factory":
|
| https://www.danstroot.com/posts/2018-10-03-hammer-factories
| code_martial wrote:
| Reminds me of a 10 letter Greek word that starts with a
| k.
| hshdhdhehd wrote:
| Gold and shovels might be a more fitting analogy for AI
| raddan wrote:
| Yes, but you are not a computer. There is no point building
| another human. We have plenty of them.
| aoeusnth1 wrote:
| Others would beg to disagree that we should be build a
| machine which can act as a human.
| WesolyKubeczek wrote:
| It's because they generate a seeming of reasoning, and don't
| actually reason!
|
| _(Slams the door angrily)_
|
| _(stomps out angrily)_
|
| _(touches the grass angrily)_
| samuell wrote:
| Yea, a bit like a cheating student rote memorizing and copying
| another students technique for solving a type of problem, and
| failing hard as soon as there's too much variation from the
| original problem.
| fsloth wrote:
| Yes!
|
| That said the input space of supported problems is quite
| large and you can configure the problem parametrs quite
| flexibly.
|
| I guess the issue is that what the model _actually_ provides
| you is this idiot savant who has pre-memorized everything
| without offering a clear index that would disambiguate well-
| supported problems from "too difficult" (i.e. novel) ones
| brap wrote:
| What is to reason, if not to generate a seeming of reasoning?
|
| _(tips fedora)_
| hshdhdhehd wrote:
| You said the quiet part out loud of political debate.
|
| (does something)
| brap wrote:
| I wonder if we can get models to reason in a structured and
| verifiable way, like we have formal logic in math.
| Frieren wrote:
| For that, you already have classical programming. It is great
| at formal logic math.
| brap wrote:
| I think trying to accurately express natural language
| statements as values and logical steps as operators is going
| to be very difficult. You also need to take into account
| ambiguity and subtext and things like that.
|
| I actually believe it is technically possible, but is going
| to be very hard.
| nl wrote:
| This is where you get the natural language tool to write
| the formal logic.
|
| ChatGPT knows WebPPL really well for example.
| brap wrote:
| You will need a formal language first.
|
| Take this statement for example:
|
| >ChatGPT knows WebPPL really well
|
| What formal language can express this statement? What
| will the text be parsed into? Which transformations can
| you use to produce other truthful (and interesting)
| statements from it? Is this flexible enough to capture
| everything that can be expressed in English?
|
| The closest that comes to mind is Prolog, but it doesn't
| really come close.
| measurablefunc wrote:
| It's doing so already. All code executed on a computer,
| especially neural networks w/o any loops are simply doing
| boolean arithmetic. In fact, the computer can't do anything
| else other than boolean arithmetic.
| alyxya wrote:
| The key point the paper seems to make is that existing benchmarks
| have relatively low complexity on reasoning complexity, so they
| made a new dataset DeepRD with arbitrarily large reasoning
| complexity and demonstrated that existing models fail at a
| complex enough problem. Complexity is defined from the complexity
| of a graph created by modeling the problem as a graph and
| determining the traversals needed to go from some source node to
| a target node.
|
| My main critique is that I don't think there's evidence that this
| issue would persist after continuing to scale models to be larger
| and doing more RL. With a harness like what coding agents do
| these days and with sufficient tool use, I bet models could go
| much further on that reasoning benchmark. Otherwise, if the
| reasoning problem were entirely done within a single context
| window, it's expected that a complex enough reasoning problem
| would be too difficult for the model to solve.
| jeremyjh wrote:
| The burden of evidence here is on you. They don't need to prove
| LRMs can't scale to meet these problems; their only claim is
| current models can't handle these problems. Others will take
| this up as a challenge - and chances may be good they will
| overcome it. This is how science works.
| alyxya wrote:
| They can't claim current models aren't able to handle these
| problems if they didn't use a setup similar to coding agents
| like Claude Code and OpenAI Codex. Using a suboptimal setup
| is akin to verbally telling a person the whole reasoning
| problem without letting them write down notes and expecting
| them to memorize and solve it after only hearing it once.
| jeremyjh wrote:
| If the models can't do it they can make that claim. If you
| want to make claims about agents then design that
| experiment, collect the data and write a paper. That is how
| science works.
| rdedev wrote:
| The thing they are testing for is reasoning performance. It
| makes sense to not give tool access.
|
| This is same as the critiques of the LLM paper by apple
| where they showed that LLMs fail to solve the tower of
| hanoi problem after a set number of towers. The test was to
| see how well these models can reason out a long task.
| People online were like they could solve that problem if
| they had access to a coding enviornment. Again the test was
| to check reasoning capability not if it knew how to code
| and algorithm to solve the problem.
|
| If model performance degrade a lot after a number of
| reasoning steps it's good to know where the limits are.
| Wheather the model had access to tools or not is orthogonal
| to this problem
| tomlockwood wrote:
| So the answer is a few more trillion?
| code_martial wrote:
| It's a worthwhile answer if it can be proven correct because
| it means that we've found a way to create intelligence, even
| if that way is not very efficient. It's still one step better
| than not knowing how to do so.
| tomlockwood wrote:
| So we're sending a trillion on faith?
| code_martial wrote:
| No, that's not what I said.
| tomlockwood wrote:
| Why are we sending the trillion?
| measurablefunc wrote:
| It must be deposited into OpenAI's bank account so that
| they can then deposit it into NVIDIA's account who can
| then in turn make a deal w/ OpenAI to deposit it back
| into OpenAI's account for some stock options. I think you
| can see how it works from here but if not then maybe one
| of the scaled up "reasoning" AIs will figure it out for
| you.
| usrbinbash wrote:
| > if it can be proven correct
|
| Then the first step would be to prove that this works
| _WITHOUT_ needing to burn through the trillions to do so.
| usrbinbash wrote:
| > I don't think there's evidence that this issue would persist
| after continuing to scale models to be larger and doing more RL
|
| And how much larger do we need to make the models? 2x? 3x? 10x?
| 100x? How large do they need to get before scaling-up _somehow_
| solves everything?
|
| Because: 2x larger, means 2x more memory and compute required.
| Double the cost or half the capacity. Would people still pay
| for this tech if it doubles in price? Bear in mind, much of it
| is already running at a loss even now.
|
| And what if 2x isn't good enough? Would anyone pay for a 10x
| larger model? Can we even realistically run such models as
| anything other than a very expensive PoC and for a very short
| time? And whos to say that even 10x will finally solve things?
| What if we need 40x? Or 100x?
|
| Oh, and of course: Larger models also require more data to
| train them on. And while the Internet is huge, it's still
| finite. And when things grow geometrically, even
| `sizeof(internet)` eventually runs out ... and, in fact, may
| have done so already [1] [2]
|
| What if we actually discover that scaling up doesn't even work
| at all, because of diminishing returns? Oh wait, looks like we
| did that already: [3]
|
| [1]: https://observer.com/2024/12/openai-cofounder-ilya-
| sutskever...
|
| [2]: https://biztechweekly.com/ai-training-data-crisis-how-
| synthe...
|
| [3]: https://garymarcus.substack.com/p/confirmed-llms-have-
| indeed...
| alyxya wrote:
| Scaling applies to multiple dimensions simultaneously over
| time. A frontier model today could be replicated a year later
| with a model half the size, with a quarter of the FLOPS, etc.
| I don't know the real numbers for optimization scaling, but
| you could check out NanoGPT speedrun [1] as an example.
|
| The best solution in the meantime is giving the LLM a harness
| that allows tool use like what coding agents have. I suspect
| current models are fully capable of solving arbitrary
| complexity artificial reasoning problems here, provided that
| they're used in the context of a coding agent tool.
|
| [1] https://github.com/KellerJordan/modded-nanogpt
| galaxyLogic wrote:
| Some problems are just too complex and the effort to solve
| them increases exponentially. No LLM can keep up with
| exponenentially increasing effort unless you run them for
| adequatte number of years.
| Infinity315 wrote:
| What? Fundamentally, information can only be so dense.
| Current models may be inefficient w.r.t. information
| density, however, there is a lower bound of compute
| required. As a pathological example, we shouldn't expect a
| megabyte worth of parameters to be able to encode the
| entirety of Wikipedia.
| BriggyDwiggs42 wrote:
| The issue is that no matter how much you train them they don't
| generalize to arbitrary sized problems. Sure you can push out
| the horizon, but you won't make something that can solve the
| problem always (assuming resources permit, and that isn't the
| issue here).
| galaxyLogic wrote:
| > complexity of a graph created by modeling the problem as a
| graph and determining the traversals needed to go from some
| source node to a target node
|
| Sounds interesting: Formalizing a problem once you know the
| solution. Seems like LLMs can't do that, or if they could they
| would evaluate where their problem solving is inadequate?
| js8 wrote:
| I think the explanation is pretty simple, as I said in my earlier
| comment: https://news.ycombinator.com/item?id=44904107
|
| I also believe the problem is we don't know what we want:
| https://news.ycombinator.com/item?id=45509015
|
| If we could make LLMs to apply a modest set of logic rules
| consistently, it would be a win.
| Sharlin wrote:
| That's a pretty big "if". LLMs are by design entirely unlike
| GoFAI reasoning engines. It's also very debatable whether it
| makes any sense to try and hack LLMs into reasoning engines
| when you could just... use a reasoning engine. Or have the LLM
| to defer to one, which would play to their strength as
| translators.
| flimflamm wrote:
| What confused me is the fact that in the paper all logical steps
| are give. It basically check that when all relevant facts are
| provided explicitly as links , how far and how complex a chain
| can the model correctly follow before it breaks down?
|
| So it's simpler than "reasoning". This is not necessarily a bad
| thing as it boils down the reasoning to a simpler, more
| controlled sub problem.
| devlogstream wrote:
| LLMs are like students, they can reason a bit, but real
| understanding still takes time and practice.
| hansmayer wrote:
| What? The LLMs are _nothing_ like students (or any other human
| for that matter).
| anal_reactor wrote:
| I'm yet to see a task that AI fails at that bottom 10% of
| population wouldn't also fail at.
| TheOtherHobbes wrote:
| How about keeping a conversation going with family over
| Thanksgiving? (Or local equivalent.)
| randomNumber7 wrote:
| This is something where the top 10% sometimes horribly fail.
| Earw0rm wrote:
| If by task you mean the written, intellectual variety, maybe.
| layer8 wrote:
| If I have the choice of performing an intellectual task myself,
| or have it performed by someone from the bottom 10% of the
| population, I'd probably rather perform it myself.
| Der_Einzige wrote:
| What happens when both choices lead to you doing it yourself?
| acdha wrote:
| The problem is consistency: AI tools usually produce output
| which _sounds_ like the top 10% but you have to read it
| carefully to find the bottom 10% parts. We're not used to that
| because human performance isn't that inconsistent and we use
| history and social factors: someone's performance goes down
| when they're really drunk, but they rarely show up to work in
| that state and it's obvious enough that other people recognize
| that they shouldn't be trusted.
| anal_reactor wrote:
| > We're not used to that because human performance isn't that
| inconsistent
|
| It is. It's very common for socially apt people to bullshit
| through things they don't know, or outright want to hide.
| acdha wrote:
| That's not inconsistent: your bluffer knows they're making
| something up and is using their model of you to construct
| something they think you'll believe. Someone who can do
| that isn't going to suddenly forget how to count the number
| of letters in a word.
| anal_reactor wrote:
| You're wrong. Counting the number of letters in a word is
| a significantly more difficult task than lying, both for
| humans and LLMs. Imagine going to a ghetto and asking
| people "have you ever lied to someone and had them
| believe the lie", and ask them to spell "continuously".
| Children learn to lie before they learn to spell.
| acdha wrote:
| > Counting the number of letters in a word is a
| significantly more difficult task than lying
|
| No, it's not - you don't even need to be literate to
| count symbols - but also consider the complexity of the
| second task and how many skills each requires: unlike
| counting letters, lying isn't simple confabulation and
| requires a theory of mind and some kind of goal. A child
| who lies to avoid trouble is doing that because they have
| enough of a world model to know they are going to get in
| trouble for something even if they haven't worked out yet
| that this is unlikely to work.
| anal_reactor wrote:
| Sure, let's stick to counting symbols. When I need to
| count something, there's a decent chance I'll get lost if
| I count beyond 10, and beyond 20 I'll get lost for sure.
| Even below 10, when I count it's one-two-three-four-five-
| six-seven-eight-nine items. But when I lie I do it
| instantaneously, without altering the pace of the
| conversation. I can come up with a believable lie within
| the brief period between someone saying something to me,
| and the moment I'm expected to respond. No way I'd be
| able to count 10 items that fast.
|
| Piraha language doesn't even have numerals - that's an
| extreme case, but there quite a few languages where
| people stop counting beyond certain small number and just
| say "a lot". Same people though don't have issues lying
| to one another. Let that sink in for a while - fully
| grown-ass adults, fully capable of functioning in their
| society, not capable of counting one-two-three because
| the concept is beyond them.
|
| What I'm trying to say is that all of those "requires
| theory of mind" statements are probably true but
| completely irrelevant because humans (and LLMs) have
| "hardware acceleration" of whatever it takes to lie,
| meanwhile counting is an abstract idea that requires to
| use the brain in a way it didn't evolve to be used.
| Similarly, LLMs cannot count if they aren't connected to
| a math engine - not because they're stupid, but because
| counting is really difficult.
| krackers wrote:
| ARC-AGI v3 is a pretty good benchmark, and it's notably
| different from the other ARC-AGI in that it has a "truer" human
| baseline (you can go play it right now and add your datapoint),
| and captures the act of in-context learning better as you start
| an unfamiliar game then master it over time.
|
| Also bottom 10% feels like a bad comparison, median human would
| be better. And unlike "specialized" things like programming,
| game playing is something almost all of us have done.
| My_Name wrote:
| I find that they know what they know fairly well, but if you move
| beyond that, into what can be reasoned from what they know, they
| have a profound lack of ability to do that. They are good at
| repeating their training data, not thinking about it.
|
| The problem, I find, is that they then don't stop, or say they
| don't know (unless explicitly prompted to do so) they just make
| stuff up and express it with just as much confidence.
| ftalbot wrote:
| Every token in a response has an element of randomness to it.
| This means they're non-deterministic. Even if you set up
| something within their training data there is some chance that
| you could get a nonsense, opposite, and/or dangerous result.
| The chance of that may be low because of things being set up
| for it to review its result, but there is no way to make a non-
| deterministic answer fully bound to solving or reasoning
| anything assuredly, given enough iterations. It is designed to
| be imperfect.
| yuvalr1 wrote:
| You are making a wrong leap from non-deterministic process to
| uncontrollable result. Most of the parallel algorithms are
| non-deterministic. There might be no guarantee about the
| order of calculation or even sometimes the final absolute
| result. However, even when producing different final results,
| the algorithm can still guarantee characteristics about the
| result.
|
| The hard problem then is not to eliminate non-deterministic
| behavior, but find a way to control it so that it produces
| what you want.
| flavaflav2 wrote:
| Life and a lot in our universe is non-deterministic. Some
| people assume science and mathematics are some universal
| truths rather than imperfect agreed upon understandings.
| Similarly many assume humans can be controlled through
| laws, penalties, prisons, propaganda, coercion, etc. But
| terrible things happen. Yes, if you set up the gutter-rails
| in your bowling lane, you can control the bowling ball
| unless it is thrown over those rails or in a completely
| different direction, but those rails are wide with LLMs by
| default, and the system instructions provided it aren't
| rules, they are an inherently faulty way to coerce a non-
| deterministic system. But, yes, if there's absolutely no
| way to do something, and you're aware of every possible way
| a response or tool could affect things, and you have taken
| every possible precaution, you can make it behave. That's
| not how people are using it though, and we cannot control
| our tendency to trust that which seems trustworthy even if
| we are told these things.
| squidbeak wrote:
| No, Science is a means of searching for those truths -
| definitely not some 'agreed upon understanding'. It's
| backed up by experimentation and reproducible proofs. You
| also make a huge bogus leap from science to humanities.
| iq176 wrote:
| Scientific method is the process. Science itself includes
| the study and compendium of understandings, based on a
| belief system that includes shared understandings just
| like mathematics. The foundation of these are
| philosophical beliefs that we can know and understand
| these things. For example, on a metaphysical level, if
| the world around us were a simulation, then science could
| provide understandings about that simulated universe, but
| not about that which is simulating it.
| squidbeak wrote:
| This I'm afraid is rubbish. Scientific proofs
| categorically don't depend on philosophical beliefs.
| Reality is measurable and the properties measured don't
| care about philosophy.
| weltensturm wrote:
| > Reality is measurable
|
| Heisenberg would disagree.
| squidbeak wrote:
| Are you arguing that the uncertainty principle derives
| from philosophy rather than math?
| darkwater wrote:
| But those are still approximations to the actual
| underlying reality. Because the other option (and yes,
| it's a dichotomy) is that we already defined and
| understood every detail of the physics that applies to
| our universe.
| squidbeak wrote:
| Indeed, that is a dichotomy: a false one. Science is
| exact without being finished.
| darkwater wrote:
| So, was Newtonian physics exact already?
| squidbeak wrote:
| > Science is exact without being finished
| darkwater wrote:
| Being exact doesn't mean it is not an approximation,
| which was the initial topic. Being exact in science means
| that 2+2=4 and that can be demonstrated following a
| logical chain. But that doesn't make our knowledge of the
| universe exact. It is still an approximation. What it can
| be "exact" is how we obtain and reproduce the current
| knowledge we have of it.
| squidbeak wrote:
| The speed of light, or plank's constant - are these
| approximations?
| mannykannot wrote:
| There seems to be more to it than that - in my experience
| with LLMs, they are good at finding some relevant facts but
| then quite often present a non-sequitur for a conclusion, and
| the article's title alone indicates that the problem for LRMs
| is similar: a sudden fall-off in performance as the task gets
| more difficult. If the issue was just non-determinism, I
| would expect the errors to be more evenly distributed, though
| I suppose one could argue that the sensitivity to non-
| determinism increases non-linearly.
| squidproquo wrote:
| The non-determinism is part of the allure of these systems --
| they operate like slot machines in a casino. The dopamine hit
| of getting an output that appears intelligent and the
| variable rewards keeps us coming back. We down-weight and
| ignore the bad outputs. I'm not saying these systems aren't
| useful to a degree, but one should understand the statistical
| implications on how we are collectively perceiving their
| usefulness.
| galaxyLogic wrote:
| > Every token in a response has an element of randomness to
| it.
|
| I haven't tried this, but so if you ask the LLM the exact
| same question again, but in a different process, will you get
| a different answer?
|
| Wouldn't that mean we should mosr of the time ask the LLM
| each question multiple times, to see if we get a better
| answer next time?
|
| A bit like asking the same question from multiple different
| LLMs just to be sure.
| PxldLtd wrote:
| I think a good test of this seems to be to provide an image and
| get the model to predict what will happen next/if x occurs.
| They fail spectacularly at Rube-Goldberg machines. I think
| developing some sort of dedicated prediction model would help
| massively in extrapolating data. The human subconscious is
| filled with all sorts of parabolic prediction, gravity,
| momentum and various other fast-thinking paths that embed these
| calculations.
| yanis_t wrote:
| Any example of that? One would think that predicting what
| comes next from an image is basically video generation, which
| works not perfect, but works somehow (Veo/Sora/Grok)
| PxldLtd wrote:
| Here's one I made in Veo3.1 since gemini is the only
| premium AI I have access to.
|
| Using this image - https://www.whimsicalwidgets.com/wp-
| content/uploads/2023/07/... and the prompt: "Generate a
| video demonstrating what will happen when a ball rolls down
| the top left ramp in this scene."
|
| You'll see it struggles - https://streamable.com/5doxh2 ,
| which is often the case with video gen. You have to
| describe carefully and orchestrate natural feeling motion
| and interactions.
|
| You're welcome to try with any other models but I suspect
| very similar results.
| chamomeal wrote:
| I love how it still copies the slow pan and zoom from
| rube goldberg machine videos, but it's just following
| along with utter nonsense lol
| galaxyLogic wrote:
| A Goldbergs machine was not part of their training data.
| For humans, we have seem such things.
| autoexec wrote:
| physics textbooks are though so it should know how they'd
| work, or at least know that balls don't spontaneously
| appear and disappear and that gears don't work when they
| aren't connected
| mannykannot wrote:
| It is video generation, but succeeding at this task
| involves detailed reasoning about cause and effect to
| construct chains of events, and may not be something that
| can be readily completed by applying "intuitions" gained
| from "watching" lots of typical movies, where most of the
| events are stereotypical.
| pfortuny wrote:
| Most amazing is asking any of the models to draw an 11-sided
| polygon and number the edges.
| Torkel wrote:
| I asked gpt5, and it worked really well with a correct
| result. Did you expect it to fail?
| pistoriusp wrote:
| I saw a meme that I think about fairly often: Great apes have
| learnt sign language, and communicated with humans, since the
| 1960's. In all that time they've never asked human questions.
| They've never tried to learn anything new! The theory is that
| they don't know that there are entities that know things they
| don't.
|
| I like to think that AI are the great apes of the digital
| world.
| 20k wrote:
| Its worth noting that the idea that great apes have learnt
| sign language is largely a fabrication by a single person,
| and nobody has ever been able to replicate this. All the
| communication has to be interpreted through that individual,
| and anyone else (including people that speak sign language)
| have confirmed that they're just making random hand motions
| in exchange for food
|
| They don't have the dexterity to really sign properly
| krapht wrote:
| Citation needed.
| joncrocks wrote:
| https://en.wikipedia.org/wiki/Great_ape_language#Criticis
| m_a... - Not word for word, but certainly casting doubt
| that apes were ever really communicating in the way that
| people may have thought.
| mkl wrote:
| That article does completely refute 20k's claim that it
| was all done by one person though.
| MangoToupe wrote:
| The way linguists define communication via language?
| Sure. Let's not drag the rest of humanity into this
| presumption.
| conception wrote:
| Searching for koko ape fraud seems to produce a lot.
| ralfd wrote:
| > In his lecture, Sapolsky alleges that Patterson
| spontaneously corrects Koko's signs: "She would ask,
| 'Koko, what do you call this thing?' and [Koko] would
| come up with a completely wrong sign, and Patterson would
| say, 'Oh, stop kidding around!' And then Patterson would
| show her the next one, and Koko would get it wrong, and
| Patterson would say, 'Oh, you funny gorilla.' "
|
| More weirdly was this lawsuit against Patterson:
|
| > The lawsuit alleged that in response to signing from
| Koko, Patterson pressured Keller and Alperin (two of the
| female staff) to flash the ape. "Oh, yes, Koko, Nancy has
| nipples. Nancy can show you her nipples," Patterson
| reportedly said on one occasion. And on another: "Koko,
| you see my nipples all the time. You are probably bored
| with my nipples. You need to see new nipples. I will turn
| my back so Kendra can show you her nipples."[47] Shortly
| thereafter, a third woman filed suit, alleging that upon
| being first introduced to Koko, Patterson told her that
| Koko was communicating that she wanted to see the woman's
| nipples
|
| There was a bonobo named Kanzi who learned hundreds of
| lexigrams. The main criticism here seems to be that while
| Kanzi truly did know the symbol for "Strawberry" he "used
| the symbol for "strawberry" as the name for the object,
| as a request to go where the strawberries are, as a
| request to eat some strawberries". So no object-verb
| sentences and so no grammar which means no true language
| according to linguists.
|
| https://linguisticdiscovery.com/posts/kanzi/
| galaxyLogic wrote:
| > So no object-verb sentences and so no grammar which
| means no true language
|
| Great distinction. The stuff about showing nipples sounds
| creepy.
| pegasus wrote:
| You only need a citation for the idea that apes _aren 't_
| able to speak sign language?
| acdha wrote:
| They claimed fraud by a single person, with zero
| replication. That's both testable so they should be able
| to support it.
|
| At the very least, more than one researcher was involved
| and more than one ape was alleged to have learned ASL.
| There is a better discussion about what our threshold is
| for speech, along with our threshold for saying that
| research is fraud vs. mistaken, but we don't fix
| sloppiness by engaging in more of it.
| galaxyLogic wrote:
| SO why wasn't the research continued further if results
| were good? My assumption is it was because of the - Fear
| of the Planet of Apes!
| rightbyte wrote:
| I mean dogs can learn a simple sign language?
| leptons wrote:
| Can the dogs sign back? Even dogs that learn to press
| buttons are mostly just pressing them to get treats. They
| don't ask questions, and it's not really a conversation.
| BOOSTERHIDROGEN wrote:
| Does that means intelligent is soul? Then we will never
| achieve AGI.
| MangoToupe wrote:
| > The theory is that they don't know that there are entities
| that know things they don't.
|
| This seems like a rather awkward way of putting it. They may
| just lack conceptualization or abstraction, making the above
| statement meaningless.
| sodality2 wrote:
| The exact title of the capacity is 'theory of mind' - for
| example, chimpanzees have a limited capacity for it in that
| they can understand others' intentions, but they seemingly
| do not understand false beliefs (this is what GP
| mentioned).
|
| https://doi.org/10.1016/j.tics.2008.02.010
| MangoToupe wrote:
| Theory of mind is a distinct concept that isn't necessary
| to explain this behavior. Of course, it may follow
| naturally, but it strikes me as ham-fisted projection of
| our own cognition onto others. Ironically, a rather
| greedy theory of mind!
| galaxyLogic wrote:
| If apes started communicating mongs themselves with sign-
| language they learned from humans that would measn they
| would get more practice using it and they could evolve it
| over aeons. Hey, isn't that what actually happened?
| pimeys wrote:
| I just got this from codex yesterday:
|
| "I wasn't able to finish; no changes were shipped."
|
| And it's not the first time.
| conception wrote:
| This is my most impressive response from Codex "* I started
| wiring up module-scoped mocks as requested, but while digging
| in I realised we're sitting on a moving target: the repo
| already has a large pile of pending, unrelated changes (whole
| modules reshuffled, new ps1 files, etc.), and the
| Enhanced.WorkflowOrchestration.Tests.ps1 under review isn't
| even tracked in the baseline branch. Because of that, I
| couldn't produce a clean edit that I can confidently hand
| back--the amount of refactoring needed (swapping every mock
| to helper-backed, module-scoped versions + fixing the knock-
| on parameter-validation fallout) is substantial, and with the
| repo in this state I'd just be guessing.
| Rather than check in something half-broken, I'm pausing here.
| Let me know how you want to proceed--if you can land
| the upstream refactor (or share a stable snapshot of the
| tests/module), I can pick this up again and finish the
| review fixes in one go."
| darkwater wrote:
| Have you threatened it with a 2 in the next round of
| performance reviews?
| amelius wrote:
| The problem is that the training data doesn't contain a lot of
| "I don't know".
| pegasus wrote:
| The bigger problem is that the benchmarks / multiple-choice
| tests they are trained to optimize for don't distinguish
| between a wrong answer and "I don't know". Which is stupid
| and surprising. There was a thread here on HN about this
| recently.
| astrange wrote:
| That's not important compared to the post-training RL, which
| isn't "training data".
| usrbinbash wrote:
| > They are good at repeating their training data, not thinking
| about it.
|
| Which shouldn't come as a surprise, considering that this is,
| at the core of things, what language models do: Generate
| sequences that are statistically likely according to their
| training data.
| dymk wrote:
| This is too large of an oversimplification of how an LLM
| works. I hope the meme that they are just next token
| predictors dies out soon, before it becomes a permanent
| fixture of incorrect but often stated "common sense". They're
| not Markov chains.
| adastra22 wrote:
| They are next token predictors though. That is literally
| wha they are. Nobody is saying they are simple Markov
| chains.
| dymk wrote:
| It's a uselessly reductive statement. A person at a
| keyboard is also a next token predictor, then.
| HarHarVeryFunny wrote:
| Yes, but it's not ALL they are.
| daveguy wrote:
| They are both designed, trained, and evaluated by how
| well they can predict the next token. It's literally what
| they do. "Reasoning" models just buildup additional
| context of next token predictions and RL is used to bias
| output options to ones more appealing to human judges.
| It's not a meme. It's an accurate description of their
| fundamental computational nature.
| gpderetta wrote:
| Indeed, they are next token predictors, but this is a
| vacuous statement because the predictor can be arbitrary
| complex.
| HarHarVeryFunny wrote:
| Sure, but a complex predictor is still a predictor. It
| would be a BAD predictor if everything it output was not
| based on "what would the training data say?".
|
| If you ask it to innovate and come up with something not
| in it's training data, what do you think it will do ....
| it'll "look at" it's training data and regurgitate
| (predict) something labelled as innovative
|
| You can put a reasoning cap on a predictor, but it's
| still a predictor.
| Workaccount2 wrote:
| To be fair, we don't actually know what is and isn't in their
| training data. So instead we just assign successes to "in the
| training set" and failures to "not in the training set".
|
| But this is unlikely, because they still can fall over pretty
| badly on things that are definitely in the training set, and
| still can have success with things that definitely are not in
| the training set.
| nakamoto_damacy wrote:
| LLMs falter because likelihood-driven pattern completion doesn't
| enforce coherence across uncertainty (probability),
| representation (geometry), composition (category), and search
| (reasoning). To get robust reasoning, we need these layers to be
| explicit, typed, and mutually constraining--with verification and
| calibrated belief updates in the loop.
|
| I was interviewed about this recently, and mentioned the great
| work of a professor of CS and Law who has been building the
| foundations for this approach. My own article about it was
| recently un-linked due to a Notion mishap (but available if
| anyone is interested - I have to publish it again)
|
| https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...
| CuriouslyC wrote:
| Richard Sutton's interview on Dwarkesh's podcast hit at this
| same point. The implicit world models in LLMs are insufficient.
| jampekka wrote:
| Sutton still hasn't learned his own Bitter Lesson? ;)
| creativeSlumber wrote:
| what do you mean?
| nakamoto_damacy wrote:
| Not sure why he capitalized bitter...
| jampekka wrote:
| It was a joke referring to his essay.
|
| https://en.wikipedia.org/wiki/Bitter_lesson
| hirako2000 wrote:
| Has any one ever found an ML/AI paper that make claims that RLMs
| can reason?
|
| When I prompt an RLM, I can see it spits out reasoning steps. But
| I don't find that evidence RLMs are capable of reasoning.
| Sharlin wrote:
| Semantics schemantics.
| hirako2000 wrote:
| It's a statistical imitation of a reasoning pattern,
| underlying mechanism is pattern matching. The ability to
| create a model that can determine two radically different
| words have strong similarity in meaning doesn't imply
| emergence of some generalizable, logical model that suddenly
| can Reason to solve novel problems.
|
| Pattern matching is a component of reason. Not === reason.
| _heimdall wrote:
| That would require the ability to understand what happens
| inside the system during inference when the output is created
| and they can't do that today.
|
| There's no evidence to be had when we only know the inputs and
| outputs of a black box.
| tempfile wrote:
| I don't understand what point you are making. Doesn't the name
| "Reasoning language models" claim that they can reason? Why do
| you want to see it explicitly written down in a paper?
| hirako2000 wrote:
| This very paper sits on the assumption reasoning (to solve
| puzzles) is at play. It calls those LLMs RLMs.
|
| Imo the paper itself should have touched on the lack of paper
| discussing what's in the blackbox that makes them Reasoning
| LMs. It does mention some tree algorithm supposedly key to
| reasoning capabilities.
|
| By no means attacking the paper as its intent is to
| demonstrate the lack of success to even solve simple to
| formulate, complex puzzles.
|
| I was not making a point, I was genuinely asking in case
| someone knows of papers I could read on that make claims with
| evidence that's those RLM actually reason, and how.
| tekno45 wrote:
| By renaming this binary to a "Mind reading language model" We
| now can read your mind and predict your choices just by
| chatting.
|
| Don't ask how it works cuz its called a "Mind reading
| language model" duh.
| egberts1 wrote:
| It's simple. Don't ingest more than 40KB at a time into its LLM's
| RAG pipe and its hallucination goes way, way down.
|
| Preferably like not at the start and best not to do more than
| 40KB at a time at all.
|
| That's how I learned how to deal with nftables' 120KB
| parser_bison.y file by breaking them up into clean sections.
|
| All of a sudden, a fully-deterministic LL(1) full semantic
| pathway of nftables' CLI syntax appears before my very eye (and
| spent hours validating it): 100% and test generators now can
| permutate crazy test cases with relative ease.
|
| Cue in Joe Walsh's "Life's Been Good To Me".
| bob_theslob646 wrote:
| Why 40kb?
| igravious wrote:
| and doesn't it depend on the LLM?
| egberts1 wrote:
| If you have your Pro or private LLM, then it's a tad bit
| bigger.
| egberts1 wrote:
| Cheap public offering of their expensive data center is that
| sweet spot and cutoff at 40KB.
| lingrush4 wrote:
| Is that really the best title the authors could come up with?
|
| Up next: "Lawn mowers are good at cutting grass until they
| aren't"
| andy99 wrote:
| I think that would be a good title if we'd previously thought
| lawn mowers had solved generalized grass cutting and assumed
| that because one worked on my lawn that they could cut
| hayfields or harvest bamboo (a grass I believe) effectively.
| tekno45 wrote:
| When the news cycle has been "lawnmowers can now do anything,
| throw away your kitchenaide" its a pretty relevant title.
| moritzwarhier wrote:
| From the abstract:
|
| > some even claiming they are capable of generalized reasoning
| and innovation in reasoning-intensive fields such as mathematics,
| physics, medicine, and law. However, by more carefully scaling
| the complexity of reasoning problems, we show existing benchmarks
| actually have limited complexity
|
| Can someone ELI5 what the definitions of reasoning and complexity
| are here?
|
| I see they seem to focus on graph problems and representing
| problems as graph problems. But I didn't completely read the
| paper or understand it in depth. I skimmed some parts that seem
| to address this question (e.g. section 5 and the Introduction),
| but maybe there are simpler definitions that elude me.
|
| Surely they don't mean "computational complexity"?
|
| And what exactly is "reasoning"?
|
| I'm aware of philosophical logic and strict logic that can be
| applied to natural language arguments.
|
| But have we already agreed on a universal scale that grades
| answers to questions about the physical world? Or is this about
| mathematical reasoning?
|
| Mixing all of this together always irks me when it comes to these
| AI "benchmarks". But apparently people see value in these?
|
| I know my question isn't new.
|
| To me it seems, that when we leave the mathematical realms, it
| quickly becomes fuzzy what correct "reasoning" should be.
|
| People can be convincing and avoid obious logical fallacies, and
| still make wrong conclusions... or conclusions that run counter
| to assumed goals.
| dcre wrote:
| Even in the mathematical/formal realm, the meaning of reasoning
| is not as clear as it seems. The _result_ of the activity of
| reasoning may be a formal argument that can be evaluated
| according to well-defined rules, but the actual process your
| mind went through to get there is just as opaque (or more) as
| whatever is going on inside LLMs. It seems likely, as you
| suggest, that we are going to have to define reasoning in terms
| of ability to solve certain classes of problems but leaving the
| character of the process unspecified.
| kordlessagain wrote:
| What specific reasoning capabilities matter for what real-world
| applications?
|
| Nobody knows.
|
| Moreover, nobody talks about that because it's boring and non-
| polarizing. Instead, supposedly smart people post stupid comments
| that prevent anyone from understanding this paper is worthless.
|
| The paper is worthless because it has a click-bait title. Blog
| posts get voted down for that, why not this?
|
| The implicit claim is worthless. Failure to navigate a synthetic
| graph == failure to solve real world problems. False.
|
| Absolutely no connection to real world examples. Just losing the
| model in endless graphs.
| wavemode wrote:
| > The implicit claim is worthless. Failure to navigate a
| synthetic graph == failure to solve real world problems. False.
|
| This statement is the dictionary definition of attacking a
| strawman.
|
| Every new model that is sold to us, is sold on the basis that
| it performs better than the old model on synthetic benchmarks.
| This paper presents a different benchmark that those same LLMs
| perform much worse on.
|
| You can certainly criticize the methodology if the authors have
| erred in some way, but I'm not sure why it's hard to understand
| the relevance of the topic itself. If benchmarks are so
| worthless then go tell that to the LLM companies.
| riskable wrote:
| My hypothesis: This is why AI is fantastic as a coding assistant
| but not so great at other things. A software developer--after
| watching an AI model fail over and over again, trying to say, fix
| a difficult bug--will stop and approach the issue from a
| different angle. They'll take a closer look at what's going on,
| fiddle things around by hand, and that's usually enough to get
| over that hump of complexity (that the AI model couldn't work its
| way through).
|
| We (developers) do this because it's what we've _always_ done
| with our own code. Everyone 's encountered a bug that they just
| couldn't figure out. So they search the Internet, try different
| implementations of the same thing, etc but nothing works.
| Usually, we finally solve such problems when we take a step back
| and look at it with a different lens.
|
| For example, just the other day--after spending far too long
| trying to get something working--I realized, "Fuck it! The users
| don't really need this feature." :thumbsup:
| acuozzo wrote:
| > AI is fantastic as a coding assistant
|
| The extent to which this is true is a rough measure of how
| derivative your work is, no?
| dankai wrote:
| This is not the only paper that scales reasoning complexity /
| difficulty.
|
| The CogniLoad benchmark does this as well (in addition to scaling
| reasoning length and distractor ratio). Requiring the LLM to
| purely reason based on what is in the context (i.e. not based on
| the information its pretrained on), it finds that reasoning
| performance decreases significantly as problems get harder (i.e.
| require the LLM to hold more information in its hidden state
| simultaneously), but the bigger challenge for them is length.
|
| https://arxiv.org/abs/2509.18458
|
| Disclaimer: I'm the primary author of CogniLoad so feel free to
| ask me any questions.
| kerabatsos wrote:
| How is that different than human reasoning?
| ares623 wrote:
| I'd like $500B to just be the way I am thanks.
| j45 wrote:
| Compared to software that can explicitly reason, reasoning models
| don't seem to reason at all.
|
| They simulate reasoning through matching patterns.
___________________________________________________________________
(page generated 2025-10-31 23:01 UTC)