[HN Gopher] The Illusion of Thinking: Understanding the Limitati...
       ___________________________________________________________________
        
       The Illusion of Thinking: Understanding the Limitations of
       Reasoning LLMs [pdf]
        
       Author : amrrs
       Score  : 316 points
       Date   : 2025-06-06 18:18 UTC (1 days ago)
        
 (HTM) web link (ml-site.cdn-apple.com)
 (TXT) w3m dump (ml-site.cdn-apple.com)
        
       | behnamoh wrote:
       | Okay Apple, you got my attention. But I'm a strong proponent of
       | "something is better than nothing" philosophy--even if
       | OpenAI/Google/etc. are building reasoning models with the
       | limitations that you describe, they are still a huge progress
       | compared to what we had not long ago. Meanwhile you're not even
       | trying.
       | 
       | It's so easy to criticize the works of others and not deliver
       | anything. Apple--be Sam in Game of Thrones: "I'm tired of reading
       | about the achievements of better men".
        
         | suddenlybananas wrote:
         | I think you're mistaking the work of researchers who work at
         | Apple with the particular investment decisions of Apple over
         | the past few years.
         | 
         | >It's so easy to criticize the works of others and not deliver
         | anything. Apple--be Sam in Game of Thrones: "I'm tired of
         | reading about the achievements of better men".
         | 
         | This is a patently absurd thing to write about a research
         | paper.
        
         | bwfan123 wrote:
         | there is enough hype already - with AGI being promised as
         | imminent.
         | 
         | this work balances the hype and shows fundamental limitations
         | so the AI hypesters are checked.
         | 
         | why be salty ?
        
       | ivape wrote:
       | This is easily explained by accepting that there is no such thing
       | as LRMs. LRMs are just LLMs that iterate on its own answers more
       | (or provides itself more context information of a certain type).
       | The reasoning loop on an "LRM" will be equivalent to asking a
       | regular LLM to "refine" its own response, or "consider"
       | additional context of a certain type. There is no such thing as
       | _reasoning_ basically, as it was always a method to  "fix"
       | hallucinations or provide more context automatically, nothing
       | else. These big companies baked in one of the hackiest prompt
       | engineering tricks that your typical enthusiast figured out long
       | ago and managed to brand it and profit off it. The craziest part
       | about this was Deepseek was able to cause a multi billion dollar
       | drop and pump of AI stocks with this _one trick_. Crazy times.
        
         | AlienRobot wrote:
         | Is that what "reasoning" means? That sounds pretty ridiculous.
         | 
         | I've thought before that AI is as "intelligent" as your
         | smartphone is "smart," but I didn't think "reasoning" would be
         | just another buzzword.
        
           | ngneer wrote:
           | I am not too familiar with the latest hype, but "reasoning"
           | has a very straightforward definition in my mind. For
           | example, can the program in question derive new facts from
           | old ones in a logically sound manner. Things like applying
           | modus ponens. (A and A => B) => B. Or, all men are mortal and
           | Socrates is a man, and therefore Socrates is mortal. If the
           | program cannot deduce new facts, then it is not reasoning, at
           | least not by my definition.
        
             | dist-epoch wrote:
             | When people say LLMs can't do X, I like to try it.
             | Q: Complete 3 by generating new knowledge:         1. today
             | is warm         2. cats likes warm temperatures         3.
             | 
             | A: Therefore, a cat is likely to be enjoying the weather
             | today.
             | 
             | Q: does the operation to create new knowledge you did have
             | a specific name?
             | 
             | A: ... Deductive Reasoning
             | 
             | Q: does the operation also have a Latin name?
             | 
             | A: ... So, to be precise, you used a syllogismus
             | (syllogism) that takes the form of Modus Ponens to make a
             | deductio (deduction).
             | 
             | https://aistudio.google.com/app/prompts/1LbEGRnzTyk-2IDdn53
             | t...
             | 
             | People then say "of course it could do that, it just
             | pattern matched a Logic text book. I meant in a real
             | example, not an artificially constructed one like this one.
             | In a complex scenario LLMs obviously can't do Modus Ponens.
        
               | ngneer wrote:
               | I do not know whether the state of the art is able to
               | reason or not. The textbook example you gave is
               | admittedly not very interesting. What you are hearing
               | from people is that parroting is not reasoning, which is
               | true.
               | 
               | I wonder if the state of the art can reason its way
               | through the following:
               | 
               | "Adam can count to 14000. Can Adam count to 13500?"
               | 
               | The response needs to be affirmative for every X1 and X2
               | such that X2 <= X1. That is reasoning. Anything else is
               | not reasoning.
               | 
               | The response when X2 > X1 is less interesting. But, as a
               | human it might be "Maybe, if Adam has time" or "Likely,
               | since counting up to any number uses the same algorithm"
               | or "I don't know".
               | 
               | Seems ChatGPT can cope with this. Other examples are easy
               | to come up with, too. There must be benchmarks for this.
               | 
               | Input to ChatGPT:
               | 
               | "Adam can lift 1000 pounds of steel. Can Adam lift 1000
               | pounds of feathers?"
               | 
               | Output from ChatGPT:
               | 
               | "1,000 pounds of feathers would be much easier for Adam
               | to lift compared to 1,000 pounds of steel, because
               | feathers are much lighter and less dense."
               | 
               | So, maybe not there yet...
        
               | dist-epoch wrote:
               | > "Adam can lift 1000 pounds of steel. Can Adam lift 1000
               | pounds of feathers?"
               | 
               | Worked for me:
               | 
               | https://chatgpt.com/share/6844813a-6e4c-8006-b560-c0be223
               | eeb...
               | 
               | gemma3-27b, a small model, had an interesting take:
               | 
               | > This is a classic trick question!
               | 
               | > While Adam can lift 1000 pounds, no, he likely cannot
               | lift 1000 pounds of feathers.
               | 
               | > Volume: Feathers take up a huge amount of space for
               | their weight. 1000 pounds of feathers would be an
               | enormous volume - likely far too large for Adam to even
               | get under, let alone lift. He'd be trying to lift a
               | massive, bulky cloud.
               | 
               | > Practicality: Even if he could somehow get it under a
               | barbell, the feathers would shift and compress, making a
               | secure grip impossible.
               | 
               | > The question plays on our understanding of weight
               | versus volume. It's designed to make you focus on the
               | "1000 pounds" and forget about the practicalities of
               | lifting something so voluminous.
               | 
               | Tried the counting question on the smallest model,
               | gemma-3n-34b, it can run on a smartphone:
               | 
               | > Yes, if Adam can count to 14000, he can definitely
               | count to 13500. Counting to a smaller number is a basic
               | arithmetic operation. 13500 is less than 14000.
        
               | ngneer wrote:
               | Thanks for trying these out :). Highlights the often
               | subtle difference between knowing the answer and deducing
               | the answer. Feathers could be ground into a pulp and
               | condensed, too. I am not trying to be clever, just seems
               | like the response is a canned answer.
        
           | JSR_FDED wrote:
           | A reasoning model is an LLM that has had additional training
           | phases that reward problem solving abilities. (But in a black
           | box way - it's not clear if the model is learning actual
           | reasoning or better pattern matching, or memorization, or
           | heuristics... maybe a bit of everything).
        
         | meroes wrote:
         | Yep. This is exactly the conclusion I reached as an RLHF'er.
         | Reasoning/LRM/SxS/CoT is "just" more context. There never was
         | reasoning. But of course, more context can be good.
        
         | Too wrote:
         | The million dollar question is how far can one get on this
         | trick. Maybe this is exactly how our own brains operate? If
         | not, what fundamental building blocks are missing to get there.
        
           | bwfan123 wrote:
           | > If not, what fundamental building blocks are missing to get
           | there
           | 
           | If I were to guess, the missing building block is the ability
           | to abstract - which is the ability to create a symbol to
           | represent something. Concrete example of abstraction is seen
           | in the axioms of lambda calculus. 1) ability to posit a
           | variable, 2) ability to define a function using said
           | variable, and 3) the ability to apply functions to things.
           | Abstraction arises from a process in the brain which we have
           | not understood yet and could be outside of computation as we
           | know it per [1]
           | 
           | [1] https://www.amazon.com/Emperors-New-Mind-Concerning-
           | Computer...
        
       | JusticeJuice wrote:
       | Their finding of LLMs working best at simple tasks, LRMs working
       | best at medium complexity tasks, and then neither succeeding at
       | actually complex tasks is good to know.
        
         | cubefox wrote:
         | Not sure whether I sense sarcasm.
        
       | nialv7 wrote:
       | I've seen this too often, papers that ask questions they don't
       | even bother to properly define.
       | 
       | > Are these models capable of generalizable reasoning, or are
       | they leveraging different forms of pattern matching?
       | 
       | Define reasoning, define generalizable, define pattern matching.
       | 
       | For additional credits after you have done so, show humans are
       | capable of what you just defined as generalizable reasoning.
        
         | NitpickLawyer wrote:
         | > show humans are capable of what you just defined as
         | generalizable reasoning.
         | 
         | I would also add "and plot those capabilities on a curve". My
         | intuition is that the SotA models are already past the median
         | human abilities in _a lot_ of areas.
        
         | crvdgc wrote:
         | In the context of this paper, I think "generalizable reasoning"
         | means that finding a method to solve the puzzle and thus being
         | able to execute the method on puzzle instances of arbitrary
         | complexity.
        
       | beneboy wrote:
       | This kind of explains why Claude will find the right solution,
       | but then the more it thinks and keeps "improving" the more over-
       | engineered (and sometimes wrong) the solution is. Interesting to
       | see this coming up in formal research.
        
       | bicepjai wrote:
       | The study challenges the assumption that more "thinking" or
       | longer reasoning traces necessarily lead to better problem-
       | solving in LRMs
        
         | bayindirh wrote:
         | As a test, I asked Gemini 2.5 Flash and Gemini 2.5 Pro to
         | decode a single BASE64 string.
         | 
         | Flash answered _correctly_ in ~2 seconds, at most. Pro answered
         | _very wrongly_ after thinking and elaborating for ~5 minutes.
         | 
         | Flash was also giving a wrong answer for the same string in the
         | past, but it improved.
         | 
         | Prompt was the same: "Hey, can you decode $BASE64_string?"
         | 
         | I have no further comments.
        
           | rafterydj wrote:
           | well that's not a very convincing argument. That's just a
           | failure to recognize when the use of a tool- base64 decoder-
           | is needed, not a reasoning problem at all, right?
        
             | BoorishBears wrote:
             | That's not really a cop out here: both models had access to
             | the same tools.
             | 
             | Realistically there are many problems that non-reasoning
             | models do better on, especially when the answer cannot be
             | solved by a thought process: like recalling internal
             | knowledge.
             | 
             | You can try to teach the model the concept of a problem
             | where thinking will likely steer it away from the right
             | answer, but at some point it becomes like the halting
             | problem... how does the model reliably think its way into
             | the realization a given problem is too complex to be
             | thought out?
        
             | bayindirh wrote:
             | I don't know whether Flash uses a tool or not, but it
             | answers pretty quickly. However, Pro opts to use its own
             | reasoning, not a tool. When I look at the reasoning train,
             | it pulls and pulls knowledge endlessly, refining that
             | knowledge and drifting away.
        
             | Jensson wrote:
             | Translating to BASE64 is a good test to see how well it
             | works as a language translator without changing things,
             | because its the same skill for an AI model.
             | 
             | If the model changes things it means it didn't really
             | capture the translation patterns for BASE64, so then who
             | knows what it will miss when translating between languages
             | if it can't even do BASE64?
        
             | layer8 wrote:
             | A moderately smart human who understands how Base64 works
             | can decode it by hand without external tools other than pen
             | and paper. Coming up with the exact steps to perform is a
             | reasoning problem.
        
       | actinium226 wrote:
       | Man, remember when everyone was like 'AGI just around the
       | corner!' Funny how well the Gartner hype cycle captures these
       | sorts of things
        
         | bayindirh wrote:
         | They're similar to self-driving vehicles. Both are around the
         | corner, but neither can negotiate the turn.
        
           | einrealist wrote:
           | All that to keep the investment pyramid schemes going.
        
           | kfarr wrote:
           | Waymo's pretty good at unprotected lefts
        
             | bayindirh wrote:
             | Waymo is pretty good at (a finite number of) unprotected
             | lefts, and this doesn't count as "level 5 autonomous
             | driving".
        
           | hskalin wrote:
           | And commerically viable nuclear fusion
        
             | mgiampapa wrote:
             | I harvest fusion energy every single day... It's just there
             | in the sky, for free!
        
           | nmca wrote:
           | I saw your comment and counted -- in May I took a Waymo
           | thirty times.
        
             | bayindirh wrote:
             | Waymo is a popular argument in self-driving cars, and they
             | do well.
             | 
             | However, Waymo is Deep Blue of self-driving cars. Doing
             | _very well_ in a _closed space_. As a result of this
             | geofencing, they have effectively exhausted their search
             | space, hence they work well as a consequence of lack of
             | surprises.
             | 
             | AI works well when search space is limited, but _General_
             | AI in any category needs to handle a vastly larger search
             | space, and they fall flat.
             | 
             | At the end of the day, AI is informed search. They get
             | inputs, and generate a suitable output as deemed by their
             | trainers.
        
         | yahoozoo wrote:
         | We will be treating LLMs "like a junior developer" forever.
        
           | JKCalhoun wrote:
           | And I'm fine with that.
        
           | sneak wrote:
           | Even if they never get better than they are today (unlikely)
           | they are still the biggest change in software development and
           | the software development industry in my 28 year career.
        
         | tonyhart7 wrote:
         | I think we just around at 80% of progress
         | 
         | the easy part is done but the hard part is so hard it takes
         | years to progress
        
           | georgemcbay wrote:
           | > the easy part is done but the hard part is so hard it takes
           | years to progress
           | 
           | There is also no guarantee of continued progress to a
           | breakthrough.
           | 
           | We have been through several "AI Winters" before where
           | promising new technology was discovered and people in the
           | field were convinced that the breakthrough was just around
           | the corner and it never came.
           | 
           | LLMs aren't quite the same situation as they do have some
           | undeniable utility to a wide variety of people even without
           | AGI springing out of them, but the blind optimism that surely
           | progress will continue at a rapid pace until the assumed
           | breakthrough is realized feels pretty familiar to the hype
           | cycle preceding past AI "Winters".
        
             | Swizec wrote:
             | > We have been through several "AI Winters" before
             | 
             | Yeah, remember when we spent 15 years (~2000 to ~2015)
             | calling it "machine learning" because AI was a bad word?
             | 
             | We use _so much_ AI in production every day but nobody
             | notices because as soon as a technology becomes useful, we
             | stop calling it AI. Then it's suddenly "just face
             | recognition" or "just product recommendations" or "just
             | [plane] autopilot" or "just adaptive cruise control" etc
             | 
             | You know a technology isn't practical yet because it's
             | still being called AI.
        
               | blks wrote:
               | I don't think there's any "AI" in aircraft autopilots.
        
               | withinboredom wrote:
               | AI encompasses a wide range of algorithms and techniques;
               | not just LLMs or neural nets. Also, it is worth pointing
               | out that the definition of AI has changed drastically
               | over the last few years and narrowed pretty
               | significantly. If you're viewing the definition from the
               | 80-90's, most of what we call "automation" today would
               | have been considered AI.
        
               | Jensson wrote:
               | Autopilots were a thing before computers were a thing,
               | you can implement one using mechanics and control theory.
               | So no, traditional autopilots are not AI under any
               | reasonable definition, otherwise every single machine we
               | build would be considered AI as almost all machines has
               | some form of control systems in them, for example is your
               | microwave clock an AI?
               | 
               | So I'd argue any algorithm that comes from control theory
               | is not AI, those are just basic old dumb machines. You
               | can't make planes without control theory, humans can't
               | keep a plane steady without it, so Wrights Brothers
               | adding this to their plane is why they succeeded making a
               | flying machine.
               | 
               | So if autopilots are AI then the Wrights Brothers
               | developed an AI to control their plane. I don't think
               | anyone sees that as AI, not even at the time they did the
               | first flight.
        
               | trc001 wrote:
               | Uh, the bellman equation was first used for control
               | theory and is the foundation of modern reinforcement
               | learning... so wouldn't that imply LLMs "come from"
               | control theory?
        
         | roenxi wrote:
         | What do you think has changed? The situation is still about as
         | promising for AGI in a few years - if not more so. Papers like
         | this are the academics mapping out where the engineering
         | efforts need to be directed to get there and it seems to be a
         | relatively small number of challenges that are easier as the
         | ones already overcome - we know machine learning can solve
         | Towers of Hanoi, for example. It isn't fundamentally
         | complicated like Baduk is. The next wall to overcome is more of
         | a low fence.
         | 
         | Besides, AI already passes the Turing test (or at least, is
         | most likely to fail because it is too articulate and
         | reasonable). There is a pretty good argument we've already
         | achieved AGI and now we're working on achieving human- and
         | superhuman-level intelligence in AGI.
        
           | MoonGhost wrote:
           | > What do you think has changed? The situation is still about
           | as promising for AGI in a few years - if not more so
           | 
           | It's better today. Hoping that LLMs can get us to AGI in one
           | hop was naive. Depending on definition of AGI we might be
           | already there. But for superhuman level in all possible tasks
           | there are many steps to be done. The obvious way is to find a
           | solution for each type of tasks. We have already for math
           | calculations, it's using tools. Many other types can be
           | solved the same way. After a while we'll gradually get to
           | well rounded 'brain', or model(s) + support tools.
           | 
           | So, so far future looks bright, there is progress, problems,
           | but not deadlocks.
           | 
           | PS: Turing test is a <beep> nobody seriously talks about
           | today.
        
         | latchup wrote:
         | To be fair, the technology sigmoid curve rises fastest just
         | before its inflection point, so it is hard to predict at what
         | point innovation slows down due to its very nature.
         | 
         | The first Boeing 747 was rolled out in 1968, only 65 years
         | after the first successful heavier-than-air flight. If you told
         | people back then that not much will fundamentally change in
         | civil aviation over the next 57 years, no one would have
         | believed you.
        
           | PantaloonFlames wrote:
           | And not just in aviation. Consider what aviation did to make
           | the world smaller. Huge 2nd order changes. The COVID-19
           | pandemic would not have happened the way it did, if there
           | were no Boeing or Airbus.
           | 
           | Big hard-to-predict changes ahead.
        
         | brookst wrote:
         | ...but that was, like, two years ago? If we go from GPT2 to AGI
         | in ten years that will still feel insanely fast.
        
         | mirekrusin wrote:
         | I remember "stochastic parrot" and people saying it's fancy
         | markov chain/dead end. You don't hear them much after roughly
         | agentic coding appeared.
        
           | Marazan wrote:
           | Spicy autocomplete is still spicy autocomplete
        
             | mirekrusin wrote:
             | I'm not sure if system capable of ie. reasoning over images
             | deserves this label anymore?
        
               | mrbungie wrote:
               | The thing is "spicy" or "glorified" autocomplete are not
               | actually bad labels, they are autocomplete machines that
               | are very good up to the point of convincing people that
               | they think.
        
             | PantaloonFlames wrote:
             | Yours seems like a c.2023 perspective of coding assistants.
             | These days it's well beyond autocomplete and "generate a
             | function that returns the numbers from the Fibonacci
             | sequence."
             | 
             | But I would think that would be well understood here.
             | 
             | How can you reduce what is currently possible to spicy
             | autocomplete? That seems pretty dismissive, so much so that
             | I wonder if it is motivated reasoning on your part.
             | 
             | I'm not saying it's good or bad; I'm just saying the
             | capability is well beyond auto complete.
        
         | otabdeveloper4 wrote:
         | AGI has always been "just around the corner", ever since
         | computers were invented.
         | 
         | Some problems have become more tractable (e.g. language
         | translation), mostly by lowering our expectations of what
         | constitutes a "solution", but AGI is nowhere nearer. AGI is a
         | secular milleniarist religion.
        
         | naasking wrote:
         | Interpreting "just around the corner" as "this year" sounds
         | like your error. Most projections are are years out, at least.
        
         | IshKebab wrote:
         | Yeah it's already been 21/2 years! How long does it take to
         | develop artificial life anyway? Surely no more than 3 years? I
         | demand my money back!
        
       | alansammarone wrote:
       | I have a somewhat similar point of view to the one voiced by
       | other people, but I like to think about it slightly differently,
       | so I'll chime in - here's my take (although, admittedly, I'm
       | operating with a quite small reasoning budget (5 minutes tops)):
       | 
       | Time and again, for centuries - with the pace picking up
       | dramatically in recent decades - we thought we were special and
       | we were wrong. Sun does not rotate around the earth, which is a
       | pretty typical planet, with the same chemical composition of any
       | other planet. All of a sudden we're not the only ones who could
       | calculate, then solve symbolic equations, then play chess, then
       | compose music, then talk, then reason (up to a point, for some
       | definition of "reason"). You get my point.
       | 
       | And when we were not only matched, but dramatically surpassed in
       | these tasks (and not a day earlier), we concluded that they
       | weren't _really_ what made us special.
       | 
       | At this point, it seems to me reasonable to assume we're _not_
       | special, and the onus should be on anybody claiming that we are
       | to at least attempt to mention in passing what is the secret
       | sauce that we have (even if we can't quite say what it is without
       | handwaving or using concepts that by definition can not be
       | defined - "qualia is the indescribable feeling of red - its
       | redness (?)).
       | 
       | Oh, and sorry, I could never quite grasp what "sentient" is
       | supposed to mean - would we be able to tell we're not sentient if
       | we weren't?
        
         | ivape wrote:
         | I can give you a pretty wild explanation. Einstein was a freak
         | of nature. Nature just gave him that "something" to figure out
         | the laws of the universe. I'm avoiding the term God as to not
         | tickle anyone incorrectly. Seriously, explain what schooling
         | and environment gets you that guy. So, to varying degrees, all
         | output is from the universe. It's hard for the ego to accept,
         | surely we earned everything we ever produced ...
         | 
         | Spooky stuff.
        
         | keiferski wrote:
         | This analogy doesn't really work, because the former examples
         | are ones in which humanity discovered that it existed in a
         | larger world.
         | 
         | The recent AI example is humanity building, or attempting to
         | build, a tool complex enough to mimic a human being.
         | 
         | If anything, you could use recent AI developments as proof of
         | humanity's uniqueness - what other animal is creating things of
         | such a scale and complexity?
        
         | suddenlybananas wrote:
         | I don't see how heliocentrisim or calculators have any bearing
         | on the uniqueness of humans.
        
       | curious_cat_163 wrote:
       | > Rather than standard benchmarks (e.g., math problems), we adopt
       | controllable puzzle environments that let us vary complexity
       | systematically
       | 
       | Very clever, I must say. Kudos to folks who made this particular
       | choice.
       | 
       | > we identify three performance regimes: (1) low complexity tasks
       | where standard models surprisingly outperform LRMs, (2) medium-
       | complexity tasks where additional thinking in LRMs demonstrates
       | advantage, and (3) high-complexity tasks where both models
       | experience complete collapse.
       | 
       | This is fascinating! We need more "mapping" of regimes like this!
       | 
       | What I would love to see (not sure if someone on here has seen
       | anything to this effect) is how these complexity regimes might
       | map to economic value of the task.
       | 
       | For that, the eval needs to go beyond puzzles but the complexity
       | of the tasks still need to be controllable.
        
         | pegasus wrote:
         | Is (1) that surprising? If I ask someone a simple question but
         | tell them to "think really hard about it", they'll be more
         | likely to treat it as a trick question and look for a non-
         | obvious answer. Overthinking it, basically.
        
       | 8bitsrule wrote:
       | Fusion has been 25 years away for all of my life.
        
         | sneak wrote:
         | Fusion is net positive energy now; that happened in 2022
         | (+54%).
         | 
         | In 2025 they got a 313% gain (4.13 output factor).
         | 
         | Fusion is actually here and working. It's not cost effective
         | yet but to pretend there has been no progress or achievements
         | is fundamentally false.
        
           | oneshtein wrote:
           | It will be cost effective in just 25 years.
        
           | sitkack wrote:
           | Negative Negs spit out low effort snark, they said the same
           | thing about solar, electric cars, even multicore, jit, open
           | source. Thanks for refuting them, the forum software itself
           | should either quarantine the response or auto respond before
           | the comment is submitted. These people don't build the
           | future.
           | 
           | Fusion News, May 28th, 2025
           | https://www.youtube.com/watch?v=1YHcI-SfKx8
        
           | lrhegeba wrote:
           | It isnt when you look at Q total. Total energy input for all
           | needed support systems versus energy produced. See
           | https://en.wikipedia.org/wiki/Fusion_energy_gain_factor for
           | more details
        
       | benlivengood wrote:
       | These are the kind of studies that make so much more sense than
       | the "LLMs can't reason because of this ideological argument or
       | this one anecdote" posts/articles. Keep 'em coming!
       | 
       | And also; the frontier LLMs blow older LLMs out of the water.
       | There is continual progress and this study would have been
       | structured substantially the same 2 years ago with much smaller N
       | on the graphs because the regimes were much tinier then.
        
       | antics wrote:
       | I think the intuition the authors are trying to capture is that
       | they believe the models are omniscient, but also dim-witted. And
       | the question they are collectively trying to ask is whether this
       | will continue forever.
       | 
       | I've never seen this question quantified in a really compelling
       | way, and while interesting, I'm not sure this PDF succeeds, at
       | least not well-enough to silence dissent. I think AI maximalists
       | will continue to think that the models are in fact getting less
       | dim-witted, while the AI skeptics will continue to think these
       | apparent gains are in fact entirely a biproduct of "increasing"
       | "omniscience." The razor will have to be a lot sharper before
       | people start moving between these groups.
       | 
       | But, anyway, it's still an important question to ask, because
       | omniscient-yet-dim-witted models terminate at "superhumanly
       | assistive" rather than "Artificial Superintelligence", which in
       | turn economically means "another bite at the SaaS apple" instead
       | of "phase shift in the economy." So I hope the authors will
       | eventually succeed.
        
         | sitkack wrote:
         | There is no reason that omniscient-yet-dim-witted has to
         | plateau at human intelligence.
        
           | antics wrote:
           | I am not sure if you mean this to refute something in what
           | I've written but to be clear I am not arguing for or against
           | what the authors think. I'm trying to state why I think there
           | is a disconnect between them and more optimistic groups that
           | work on AI.
        
             | drodgers wrote:
             | I think that commenter was disagreeing with this line:
             | 
             | > because omniscient-yet-dim-witted models terminate at
             | "superhumanly assistive"
             | 
             | It might be that with dim wits + enough brute force
             | (knowledge, parallelism, trial-and-error, specialisation,
             | speed) models could still substitute for humans and
             | transform the economy in short order.
        
               | antics wrote:
               | Sorry, I can't edit it any more, but what I was trying to
               | say is that if the authors are correct, that this
               | distinction is philosophically meaningful, then that is
               | the conclusion. If they are not correct, then all their
               | papers on this subject are basically meaningless.
        
               | Byamarro wrote:
               | And we have a good example of a dimwitted, brute-force
               | process creating intelligent designs - evolution.
        
               | drodgers wrote:
               | Also corporations, governments etc. - they're capable of
               | things that none of the individuals could do alone.
        
         | drodgers wrote:
         | > I think AI maximalists will continue to think that the models
         | are in fact getting less dim-witted
         | 
         | I'm bullish (and scared) about AI progress precisely because I
         | think they've only gotten a little less dim-witted in the last
         | few years, but their practical capabilities have improved a
         | _lot_ thanks to better knowledge, taste, context, tooling etc.
         | 
         | What scares me is that I think there's a reasoning/agency
         | capabilities overhang. ie. we're only one or two breakthroughs
         | away from something which is both kinda omniscient (where we
         | are today), and able to out-think you very quickly (if only
         | through dint of applying parallelism to actually competent
         | outcome-modelling and strategic decision making).
         | 
         | That combination is terrifying. I don't think enough people
         | have really imagined what it would mean for an AI to be able to
         | out-strategise humans in the same way that they can now -- say
         | -- out-poetry humans (by being both decent in terms of quality
         | and _super_ fast). It 's like when you're speaking to someone
         | way smarter than you and you realise that they're 6 steps
         | ahead, and actively shaping your thought process to guide you
         | where they want you to end up. At scale. For everything.
         | 
         | This exact thing (better reasoning + agency) is also the top
         | priority for all of the frontier researchers right now (because
         | it's super useful), so I think a breakthrough might not be far
         | away.
         | 
         | Another way to phrase it: I think today's LLMs are about as
         | good at snap judgements in most areas as the best humans
         | (probably much better at everything that rhymes with inferring
         | vibes from text), but they kinda suck at:
         | 
         | 1. Reasoning/strategising step-by-step for very long periods
         | 
         | 2. Snap judgements about reasoning or taking strategic actions
         | (in the way that expert strategic humans don't actually need to
         | think through their actions step-by-step very often - they've
         | built intuition which gets them straight to the best answer 90%
         | of the time)
         | 
         | Getting good at the long range thinking might require more
         | substantial architectural changes (eg. some sort of separate
         | 'system 2' reasoning architecture to complement the already
         | pretty great 'system 1' transformer models we have). OTOH, it
         | might just require better training data and algorithms so that
         | the models develop good enough strategic taste and agentic
         | intuitions to get to a near-optimal solution quickly before
         | they fall off a long-range reasoning performance cliff.
         | 
         | Of course, maybe the problem is really hard and there's no easy
         | breakthrough (or it requires 100,000x more computing power than
         | we have access to right now). There's no certainty to be found,
         | but a scary breakthrough definitely seems possible to me.
        
           | sitkack wrote:
           | I think you are right, and that the next step function can be
           | achieved using the models we have, either by scaling the
           | inference, or changing the way inference is done.
        
             | danielmarkbruce wrote:
             | People are doing all manner of very sophisticated inferency
             | stuff now - it just tends to be extremely expensive for now
             | and... people are keeping it secret.
        
               | Jensson wrote:
               | If it was good enough to replace people then it wouldn't
               | be too expensive, they would have launched it and
               | replaced a bunch of people and made trillions of dollars
               | by now.
               | 
               | So at best their internal models are still just
               | performance multipliers unless some breakthrough happened
               | very recently, it might be a bigger multiplier but that
               | still keeps humans with jobs etc and thus doesn't
               | revolutionize much.
        
         | imiric wrote:
         | > I think the intuition the authors are trying to capture is
         | that they believe the models are omniscient, but also dim-
         | witted.
         | 
         | We keep assigning adjectives to this technology that
         | anthropomorphize the neat tricks we've invented. There's
         | nothing "omniscient" or "dim-witted" about these tools. They
         | have no wit. They do not think or reason.
         | 
         | All Large "Reasoning" Models do is generate data that they use
         | as context to generate the final answer. I.e. they do real-time
         | tuning based on synthetic data.
         | 
         | This is a neat trick, but it doesn't solve the underlying
         | problems that plague these models like hallucination. If the
         | "reasoning" process contains garbage, gets stuck in loops,
         | etc., the final answer will also be garbage. I've seen sessions
         | where the model approximates the correct answer in the first
         | "reasoning" step, but then sabotages it with senseless "But
         | wait!" follow-up steps. The final answer ends up being a
         | mangled mess of all the garbage it generated in the "reasoning"
         | phase.
         | 
         | The only reason we keep anthropomorphizing these tools is
         | because it makes us feel good. It's wishful thinking that
         | markets well, gets investors buzzing, and grows the hype
         | further. In reality, we're as close to artificial intelligence
         | as we were a decade ago. What we do have are very good pattern
         | matchers and probabilistic data generators that can leverage
         | the enormous amount of compute we can throw at the problem.
         | Which isn't to say that this can't be very useful, but
         | ascribing human qualities to it only muddies the discussion.
        
           | antics wrote:
           | I am not sure we are on the same page that the point of my
           | response is that this paper is not enough to prevent exactly
           | the argument you just made.
           | 
           | In any event, if you want to take umbrage with this paper, I
           | think we will need to back up a bit. The authors use a
           | mostly-standardized definition of "reasoning", which is
           | widely-accepted enough to support not just one, but several
           | of their papers, in some of the best CS conferences in the
           | world. I actually think you are right that it is reasonable
           | to question this definition (and some people do), but I think
           | it's going to be really hard for you to start that discussion
           | here without (1) saying what your definition specifically is,
           | and (2) justifying why its better than theirs. Or at the very
           | least, borrowing one from a well-known critique like, _e.g._
           | , Gebru's, Bender's, _etc_.
        
           | Kon5ole wrote:
           | >They have no wit. They do not think or reason.
           | 
           | Computers can't think and submarines can't swim.
        
             | Jensson wrote:
             | But if you need a submarine that can swim as agiley as a
             | fish then we still aren't there yet, fish are far superior
             | to submarines in many ways. So submarines might be faster
             | than fish, but there are so many maneuvers that fish can do
             | that the submarine can't. Its the same with here with
             | thinking.
             | 
             | So just like computers are better at humans at multiplying
             | numbers, there are still many things we need human
             | intelligence for even in todays era of LLM.
        
               | Kon5ole wrote:
               | The point here (which is from a quote by Dijkstra) is
               | that if the desired result is achieved (movement through
               | water) it doesn't matter if it happens in a different way
               | than we are used to.
               | 
               | So if an LLM generates working code, correct
               | translations, valid points relating to complex matters
               | and so on it doesn't matter if it does so by thinking or
               | by some other mechanism.
               | 
               | I think that's an interesting point.
        
               | Jensson wrote:
               | > if the desired result is achieved (movement through
               | water) it doesn't matter if it happens in a different way
               | than we are used to
               | 
               | But the point is that the desired result isn't achieved,
               | we still need humans to think.
               | 
               | So we still need a word for what humans do that is
               | different from what LLM does. If you are saying there is
               | no difference then how do you explain the vast difference
               | in capability between humans and LLM models?
               | 
               | Submarines and swimming is a great metaphor for this,
               | since Submarines clearly doesn't swim and thus have very
               | different abilities in water, its way better in some ways
               | but way worse in other ways. So using that metaphor its
               | clear that LLM "thinking" cannot be described with the
               | same words as human thinking since its so different.
        
               | Kon5ole wrote:
               | >If you are saying there is no difference then how do you
               | explain the vast difference in capability between humans
               | and LLM models?
               | 
               | No I completely agree that they are different, like
               | swimming and propulsion by propellers - my point is that
               | the difference may be irrelevant in many cases.
               | 
               | Humans haven't been able to beat computers in chess since
               | the 90s, long before LLM's became a thing. Chess engines
               | from the 90s were not at all "thinking" in any sense of
               | the word.
               | 
               | It turns out "thinking" is not required in order to win
               | chess games. Whatever mechanism a chess engine uses gets
               | better results than a thinking human does, so if you want
               | to win a chess game, you bring a computer, not a human.
               | 
               | What if that also applies to other things, like
               | translation of languages, summarizing complex texts,
               | writing advanced algorithms, realizing implications from
               | a bunch of seemingly unrelated scientific papers, and so
               | on. Does it matter that there was no "thinking" going on,
               | if it works?
        
               | jplusequalt wrote:
               | >So if an LLM generates working code
               | 
               | It matters when code bases become hard to parse because
               | the engineers throwing shit together with Cursor have
               | made an ungrokkable ball of shit.
        
             | naasking wrote:
             | "Can't" is a pretty strong word, effectively entailing
             | "never". Never is a long time to believe computers can't
             | think.
        
           | tim333 wrote:
           | >There's nothing "omniscient" or "dim-witted" about these
           | tools
           | 
           | I disagree in that that seems quite a good way of describing
           | them. All language is a bit inexact.
           | 
           | Also I don't buy we are no closer to AI than ten years ago -
           | there seem lots going on. Just because LLMs are limited
           | doesn't mean we can't find or add other algorithms - I mean
           | look at alphaevolve for example https://www.technologyreview.
           | com/2025/05/14/1116438/google-d...
           | 
           | >found a faster way to solve matrix multiplications--a
           | fundamental problem in computer science--beating a record
           | that had stood for more than 50 years
           | 
           | I figure it's hard to argue that that is not at least
           | somewhat intelligent?
        
             | imiric wrote:
             | > I figure it's hard to argue that that is not at least
             | somewhat intelligent?
             | 
             | The fact that this technology can be very useful doesn't
             | imply that it's intelligent. My argument is about the
             | language used to describe it, not about its abilities.
             | 
             | The breakthroughs we've had is because there is a lot of
             | utility from finding patterns in data which humans aren't
             | very good at. Many of our problems can be boiled down to
             | this task. So when we have vast amounts of data and compute
             | at our disposal, we can be easily impressed by results that
             | seem impossible for humans.
             | 
             | But this is not intelligence. The machine has no semantic
             | understanding of what the data represents. The algorithm is
             | optimized for generating specific permutations of tokens
             | that match something it previously saw and was rewarded
             | for. Again, very useful, but there's no thinking or
             | reasoning there. The model doesn't have an understanding of
             | why the wolf can't be close to the goat, or how a cabbage
             | tastes. It's trained on enough data and algorithmic tricks
             | that its responses can fool us into thinking it does, but
             | this is just an illusion of intelligence. This is why we
             | need to constantly feed it more tricks so that it doesn't
             | fumble with basic questions like how many "R"s are in
             | "strawberry", or that it doesn't generate racially diverse
             | but historically inaccurate images.
        
               | tim333 wrote:
               | I imagine if you asked the LLM why the wolf can't be
               | close to the goat it would give a reasonable answer. I
               | realise it does it by using permutation of tokens but I
               | think you have to judge intelligence by the results
               | rather than the mechanism otherwise you could argue
               | humans can't be intelligent because they are just a bunch
               | of neurons that find patterns.
        
               | Jensson wrote:
               | We have had programs that can give good answers to some
               | hard questions for a very long time now. Watson won
               | jeapordy already 2011, but it still wasn't very good at
               | replacing humans.
               | 
               | So that isn't a good way to judge intelligence, computers
               | are so fast and have so much data that you can make
               | programs to answer just about anything pretty well, LLM
               | is able to do that but more automatic. But it still
               | doesn't automate the logical parts yet, just the lookup
               | of knowledge, we don't know how to train large logic
               | models, just large language models.
        
               | eMPee584 wrote:
               | LLMs are not the only model type though? There's a
               | plethora of architectures and combinations being
               | researched.. And even transformers start to be able to do
               | cool sh1t on knowledge graphs, also interesting is
               | progress on autoregressive physics PDE (partial
               | differential equations) models.. and can't be too long
               | until some providers of actual biological neural nets
               | show up on openrouter (probably a lot less energy and
               | capital intense to scale up brain goo in tanks compared
               | to gigawatt GPU clusters).. combine that zoo of "AI"
               | specimen using M2M, MCP etc. and the line between mock
               | and "true"intelligence will blur, escalating our feable
               | species into ASI territory.. good luck to us.
        
               | Jensson wrote:
               | > There's a plethora of architectures and combinations
               | being researched
               | 
               | There were plethora of architectures and combinations
               | being researched before LLM, still took a very long time
               | to find LLM architecture.
               | 
               | > the line between mock and "true"intelligence will blur
               | 
               | Yes, I think this will happen at some point. The question
               | is how long it will take, not if it will happen.
               | 
               | The only thing that can stop this is if intermediate AI
               | is good enough to give every human a comfortable life but
               | still isn't good enough to think on its own.
               | 
               | Its easy to imagine such an AI being developed, imagine a
               | model that can learn to mimic humans at any task, but
               | still cannot update itself without losing those skills
               | and becoming worse. Such an AI could be trained to
               | perform every job on earth as long as we don't care about
               | progress.
               | 
               | If such an AI is developed, and we don't quickly solve
               | the remaining problems to get an AI to be able to
               | progress science on its own, its likely our progress
               | entirely stalls there as humans will no longer have a
               | reason to go to school to advance science.
        
               | swat535 wrote:
               | > you have to judge intelligence by the results rather
               | than the mechanism
               | 
               | This would be the exact opposite conclusion of the
               | Chinese room: https://en.wikipedia.org/wiki/Chinese_room
               | 
               | I think you'd need to offer a stronger counter argument
               | than the one you presented here.
        
               | tim333 wrote:
               | Actually I think the Chinese room fits my idea. It's a
               | silly thought experiment that would never work in
               | practice. If you tried to make one you would judge it
               | unintelligent because it wouldn't work. Or at least in
               | the way Searle implied - he basically proposed a look up
               | table.
        
               | grugagag wrote:
               | I keep on trying this wolf cabbage goat problem with
               | various permutations, let's say just a wolf and a
               | cabbage, no goat mentioned. At some step the got
               | materializes in the answer. I tell it there is no goat
               | and yet it answers again and the goat is there.
        
               | BriggyDwiggs42 wrote:
               | This approach to defining "true" intelligence seems
               | flawed to me because of examples in biology where
               | semantic understanding is in no way relevant to function.
               | A slime mold solving a maze doesn't even have a brain,
               | yet it solves a problem to get food. There's no knowing
               | that it does that, no complex signal processing, no self-
               | perception of purpose, but nevertheless it gets the food
               | it needs. My response to that isn't to say the slime mold
               | has no intelligence, it's to widen the definition of
               | intelligence to include the mold. In other words,
               | intelligence is something one does rather than has; it's
               | not the form but the function of the thing. Certainly
               | LLMs lack anything in any way resembling human
               | intelligence, they even lack brains, but they demonstrate
               | a capacity to solve problems I don't think is
               | unreasonable to label intelligent behavior. You can put
               | them in some mazes and LLMs will happen to solve them.
        
               | hackinthebochs wrote:
               | >The machine has no semantic understanding of what the
               | data represents.
               | 
               | How do you define "semantic understanding" in a way that
               | doesn't ultimately boil down to saying they don't have
               | phenomenal consciousness? Any functional concept of
               | semantic understanding is captured to some degree by
               | LLMs.
               | 
               | Typically when we attribute understanding to some entity,
               | we recognize some substantial abilities in the entity in
               | relation to that which is being understood. Specifically,
               | the subject recognizes relevant entities and their
               | relationships, various causal dependences, and so on.
               | This ability goes beyond rote memorization, it has a
               | counterfactual quality in that the subject can infer
               | facts or descriptions in different but related cases
               | beyond the subject's explicit knowledge. But LLMs excel
               | at this.
               | 
               | >feed it more tricks so that it doesn't fumble with basic
               | questions like how many "R"s are in "strawberry"
               | 
               | This failure mode has nothing to do with LLMs lacking
               | intelligence and everything to do with how tokens are
               | represented. They do not see individual characters, but
               | sub-word chunks. It's like expecting a human to count the
               | pixels in an image it sees on a computer screen. While
               | not impossible, it's unnatural to how we process images
               | and therefore error-prone.
        
           | BoiledCabbage wrote:
           | > There's nothing "omniscient" or "dim-witted" about these
           | tools. They have no wit. They do not think or reason.
           | 
           | > All Large "Reasoning" Models do is generate data that they
           | use as context to generate the final answer. I.e. they do
           | real-time tuning based on synthetic data.
           | 
           | I always wonder when people make comments like this if they
           | struggle with analogies. Or if it's a lack of desire to
           | discuss concepts at different levels of abstraction.
           | 
           | Clearly an LLM is not "omniscient". It doesn't require a post
           | to refute that, OP obviously doesn't mean that literally.
           | It's an analogy describing two semi (fairly?) independent
           | axes. One on breadth of knowledge, one on something more
           | similar to intelligence and being able to "reason" from
           | smaller components of knowledge. The opposite of which is dim
           | witted.
           | 
           | So at one extreme you'd have something completely unable to
           | generalize or synthesize new results. Only able to correctly
           | respond if it identically matches prior things it has seen,
           | but has seen and stored a ton. At the other extreme would be
           | something that only knows a very smal set of general facts
           | and concepts but is extremely good at reasoning from first
           | principles on the fly. Both could "score" the same on an
           | evaluation, but have very different projections for future
           | growth.
           | 
           | It's a great analogy and way to think about the problem. And
           | it me multiple paragraphs to write ehat OP expressed in two
           | sentences via a great analogy.
           | 
           | LLMs are a blend of the two skills, apparently leaning more
           | towards the former but not completely.
           | 
           | > What we do have are very good pattern matchers and
           | probabilistic data generators
           | 
           | This an unhelpful description. And object is more than the
           | sum of its parts. And higher levels behaviors emerge. This
           | statement is factually correct and yet the equivalent of
           | describing a computer as nothing more than a collection of
           | gates and wires so shouldn't be discussed at a higher level
           | of abstraction.
        
       | esafak wrote:
       | I don't know that I would call it an "illusion of thinking", but
       | LLMs do have limitations. Humans do too. No amount of human
       | thinking has solved numerous open problems.
        
         | th0ma5 wrote:
         | The errors that LLMs make and the errors that people make are
         | not probably not comparable enough in a lot of the discussions
         | about LLM limitations at this point?
        
           | esafak wrote:
           | We have different failure modes. And I'm sure researchers,
           | faced with these results, will be motivated to overcome these
           | limitations. This is all good, keep it coming. I just don't
           | understand the some of the naysaying here.
        
             | Jensson wrote:
             | They naysayers just says that even when people are
             | motivated to solve a problem the problem might still not
             | get solved. And there are unsolved problems still with LLM,
             | the AI hypemen say AGI is all but a given in a few years
             | time, but if that relies on some undiscovered breakthrough
             | that is very unlikely since such breakthroughs are very
             | rare.
        
       | danck wrote:
       | In figure 1 bottom-right they show how the correct answers are
       | being found later as the complexity goes higher. In the
       | description they even state that in false responses the LRM often
       | focusses on a wrong answer early and then runs out of tokens
       | before being able to self-correct. This seems obvious and
       | indicates that it's simply a matter of scaling (bigger token
       | budget would lead better abilities for complexer tasks). Am I
       | missing something?
        
       | teleforce wrote:
       | > We found that LRMs have limitations in exact computation: they
       | fail to use explicit algorithms and reason inconsistently across
       | puzzles.
       | 
       | It seems that AI LLMs/LRMs need helps from their distant cousins
       | namely logic, optimization and constraint programming that can be
       | attributed as intelligent automation or IA [1],[2],[3],[4].
       | 
       | [1] Logic, Optimization, and Constraint Programming: A Fruitful
       | Collaboration - John Hooker - CMU (2023) [video]:
       | 
       | https://www.youtube.com/live/TknN8fCQvRk
       | 
       | [2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT
       | (2011) [video]:
       | 
       | https://youtube.com/watch?v=HB5TrK7A4pI
       | 
       | [3] Google OR-Tools:
       | 
       | https://developers.google.com/optimization
       | 
       | [4] MiniZinc:
       | 
       | https://www.minizinc.org/
        
       | thomasahle wrote:
       | All the environments the test (Tower of Hanoi, Checkers Jumping,
       | River Crossing, Block World) could easily be solved perfectly by
       | any of the LLMs if the authors had allowed it to write code.
       | 
       | I don't really see how this is different from "LLMs can't
       | multiply 20 digit numbers"--which btw, most humans can't either.
       | I tried it once (using pen and paper) and consistently made
       | errors somewhere.
        
         | someothherguyy wrote:
         | > humans can't
         | 
         | The reasons humans can't and the reasons LLMs can't are
         | completely different though. LLMs are often incapable of
         | performing multiplication. Many humans just wouldn't care to do
         | it.
        
         | Jensson wrote:
         | > I don't really see how this is different from "LLMs can't
         | multiply 20 digit numbers"--which btw, most humans can't
         | either. I tried it once (using pen and paper) and consistently
         | made errors somewhere.
         | 
         | People made missiles and precise engineering like jet aircraft
         | before we had computers, humans can do all of those things
         | reliably just by spending more time thinking about it,
         | inventing better strategies and using more paper.
         | 
         | Our brains weren't made to do such computations, but a general
         | intelligence can solve the problem anyway by using what it has
         | in a smart way.
        
           | thomasahle wrote:
           | Some specialized people could probably do 20x20, but I'd
           | still expect them to make a mistake at 100x100. The level we
           | needed for space crafts was much less than that, and we had
           | many levels of checks to help catch errors afterwards.
           | 
           | I'd wager that 95% of humans wouldn't be able to do 10x10
           | multiplication without errors, even if we paid them $100 to
           | get it right. There's a reason we had to invent lots of
           | machines to help us.
           | 
           | It would be an interesting social studies paper to try and
           | recreate some "LLMs can't think" papers with humans.
        
             | Jensson wrote:
             | > There's a reason we had to invent lots of machines to
             | help us.
             | 
             | The reason was efficiency, not that we couldn't do it. If a
             | machine can do it then we don't need expensive humans to do
             | it, so human time can be used more effectively.
        
           | jdmoreira wrote:
           | No. a huge population of humans did while standing on the
           | shoulders of giants.
        
             | Jensson wrote:
             | Humans aren't giants, they stood on the shoulder of other
             | humans. So for AI to be equivalent they should stand on the
             | shoulders of other AI models.
        
               | jdmoreira wrote:
               | building for thousands of years with a population size in
               | the range between millions and billions at any given
               | time.
        
               | Jensson wrote:
               | Right, and when we have AI that can do the same with
               | millions/billions of computers then we can replace
               | humans.
               | 
               | But as long as AI cannot do that they cannot replace
               | humans, and we are very far from that. Currently AI
               | cannot even replace individual humans in most white
               | collar jobs, and replacing entire team is way harder than
               | replacing an individual, and then even harder is
               | replacing workers in an entire field meaning the AI has
               | to make research and advances on its own etc.
               | 
               | So like, we are still very far from AI completely being
               | able to replace human thinking and thus be called AGI.
               | 
               | Or in other words, AI has to replace those giants to be
               | able to replace humanity, since those giants are humans.
        
         | Xmd5a wrote:
         | >Large Language Model as a Policy Teacher for Training
         | Reinforcement Learning Agents
         | 
         | >In this paper, we introduce a novel framework that addresses
         | these challenges by training a smaller, specialized student RL
         | agent using instructions from an LLM-based teacher agent. By
         | incorporating the guidance from the teacher agent, the student
         | agent can distill the prior knowledge of the LLM into its own
         | model. Consequently, the student agent can be trained with
         | significantly less data. Moreover, through further training
         | with environment feedback, the student agent surpasses the
         | capabilities of its teacher for completing the target task.
         | 
         | https://arxiv.org/abs/2311.13373
        
         | hskalin wrote:
         | Well that's because all these LLMs have memorized a ton of code
         | bases with solutions to all these problems.
        
         | bwfan123 wrote:
         | > but humans cant do it either
         | 
         | This argument is tired as it keeps getting repeated for any
         | flaws seen in LLMs. And the other tired argument is: wait !
         | this is a sigmoid curve, and we have not seen the inflection
         | point yet. If someone have me a penny for every comment saying
         | these, I'd be rich by now.
         | 
         | Humans invented machines because they could not do certain
         | things. All the way from simple machines in physics (Archimedes
         | lever) to the modern computer.
        
           | thomasahle wrote:
           | > Humans invented machines because they could not do certain
           | things.
           | 
           | If your disappointment is that the LLM didn't invent a
           | computer to solve the problem, maybe you need to give it
           | access to physical tools, robots, labs etc.
        
             | mrbungie wrote:
             | Nah, even if we follow such a weak "argument" the fact is
             | that, ironically, the evidence shown in this and other
             | papers point towards the idea that even if LRMs did have
             | access to physical tools, robots labs, etc*, they probably
             | would not be able to harness them properly. So even if we
             | had an API-first world (i.e. every object and subject in
             | the world can be mediated via a MCP server), they wouldn't
             | be able to perform as well as we hope.
             | 
             | Sure, humans may fail doing a 20 digit multiplication
             | problems but I don't think that's relevant. Most aligned,
             | educated and well incentivized humans (such as the ones
             | building and handling labs) will follow complex
             | instructions correctly and predictably, probably ill-
             | defined instructions, worse than an exact Towers of Hanoi
             | solving algorithm. Don't misinterpret me, human errors do
             | happen in those contexts because, well, we're talking about
             | humans, but not as catastrophically as the errors committed
             | by LRMs in this paper.
             | 
             | I'm kind of tired of people comparing humans to machines in
             | such simple and dishonest ways. Such thoughts pollute the
             | AI field.
             | 
             | *In this case for some of the problems the LRMs were given
             | an exact algorithm to follow, and they didn't. I wouldn't
             | keep my hopes up for an LRM handling a full physical
             | laboratory/factory.
        
         | mjburgess wrote:
         | The goal isnt to assess the LLM capability at solving any of
         | those problems. The point isnt how good they are at block world
         | puzzles.
         | 
         | The point is to construct non-circular ways of quantifying
         | model performance in reasoning. That the LLM has access to
         | prior exemplars of any given problem is exactly the issue in
         | establishing performance in reasoning, over historical
         | synthesis.
        
       | cdrini wrote:
       | When I use a normal LLM, I generally try to think "would I be
       | able to do this without thinking, if I had all the knowledge, but
       | just had to start typing and go?".
       | 
       | With thinking LLMs, they can think, but they often can only think
       | in one big batch before starting to "speak" their true answer. I
       | think that needs to be rectified so they can switch between the
       | two. In my previous framework, I would say "would I be able to
       | solve this if had all the knowledge, but could only think then
       | start typing?".
       | 
       | I think for larger problems, the answer to this is no. I would
       | need paper/a whiteboard. That's what would let me think, write,
       | output, iterate, draft, iterate. And I think that's where agentic
       | AI seems to be heading.
        
       | d4rkn0d3z wrote:
       | I wrote my first MLP 25 years ago. After repeating some early
       | experiments in machine learning from 20 ywars before that. One of
       | the experiments I repeated was in text to speach. It was amazing
       | to set up training runs and return after seveal hours to listen
       | to my supercomputer babble like a toddler. I literally recall
       | listening and being unable to distinguish the output from my NN
       | from that of a real toddler, I happened to be teaching my neice
       | to read around that same time. And when the NN had gained a large
       | vocabulary such that it could fairly proficiently read aloud, I
       | was convinced that I had found my PHD project and a path to AGI.
       | 
       | Further examination and discussion with more experienced
       | researchers gave me pause. They said that one must have a
       | solution, or a significant new approach toward solving the hard
       | problems associated with a research project for it to be viable,
       | otherwise time (and money) is wasted finding new ways to solve
       | the easy problems.
       | 
       | This is a more general principle that can be applied to most
       | areas of endeavour. When you set about research and development
       | that involves a mix of easy, medium, and hard problems, you must
       | solve the hard problems first otherwise you blow your budget
       | finding new ways to solve the easy problems, which nobody cares
       | about in science.
       | 
       | But "AI" has left the realm of science behind and entered the
       | realm of capitalism where several years of meaningless
       | intellectual gyration without ever solving a hard problem may be
       | quite profitable.
        
       | throwaway71271 wrote:
       | I think one of the reason we are confused about what LLMs can do
       | is because they use language. And we look at the "reasoning
       | traces" and the tokens there look human, but what is actually
       | happening is very alien to us, as shown by "Biology of Large
       | Language Models"[1] and "Safety Alignment Should Be Made More
       | Than Just a Few Tokens Deep"[2]
       | 
       | I am struggling a lot to see what the tech can and can not do,
       | particularly designing systems with them, and how to build
       | systems where the whole is bigger than the sum of its parts. And
       | I think this is because I am constantly confused by their
       | capabilities, despite understanding their machinery and how they
       | work, their use of language just seems like magic. I even wrote
       | https://punkx.org/jackdoe/language.html just to remind myself how
       | to think about it.
       | 
       | I think this kind of research is amazing and we have to spend
       | tremendous more effort into understanding how to use the tokens
       | and how to build with them.
       | 
       | [1]: https://transformer-circuits.pub/2025/attribution-
       | graphs/bio... [2]: https://arxiv.org/pdf/2406.05946
        
         | dleeftink wrote:
         | The opposite might apply, too; the whole system may be smaller
         | than its parts, as it excels at individual tasks but mixes
         | things up in combination. Improvements will be made, but I
         | wonder if we should aim for generalists, or accept more
         | specialist approaches as it is difficult to optimise for all
         | tasks at once.
        
           | throwaway71271 wrote:
           | You know the meme "seems like will have AGI before we can
           | reliably parse PDFs" :)
           | 
           | So if you are building a system, lets say you ask it to parse
           | a pdf, and you put a judge to evaluate the quality of the
           | output, and then you create a meta judge to improve the
           | prompts of the parser and the pdf judge. The question is, is
           | this going to get better as it is running, and even more, is
           | it going to get better as the models are getting better?
           | 
           | You can build the same system in completely different way,
           | more like 'program synthesis' imagine you dont use llms to
           | parse, but you use them to write parser code, and tests, and
           | then judge to judge the tests, or even escalate to human to
           | verify, then you train your classifier that picks the parser.
           | Now this system is much more likely to improve itself as it
           | is running, and as the models are getting better.
           | 
           | Few months ago Yannic Kilcher gave this example as that it
           | seems that current language models are very constrained mid-
           | sentence, because they most importantly want produce
           | semantically consistent and grammatically correct text, so
           | the entropy mid sentence is very different than the entropy
           | after punctuation. The . dot "frees" the distribution. What
           | does that mean for "generalists" or "specialists" approach
           | when sampling the wrong token can completely derail
           | everything?
           | 
           | If you believe that the models will "think" then you should
           | bet on the prompt and meta prompt approach, if you believe
           | they will always be limited then you should build with
           | program synthesis.
           | 
           | And, honestly, I am totally confused :) So this kind of
           | research is incredibly useful to clear the mist. Also things
           | like https://www.neuronpedia.org/
           | 
           | E.G. Why compliment (you can do this task), guilt (i will be
           | fired if you don't do this task), and threatening (i will
           | harm you if you don't do this task) work with different
           | success rate? Sergey Brin said recently that threatening
           | works best, I cant get my self to do it, so I take his word
           | for it.
        
             | K0balt wrote:
             | Sergey will be the first victim of the coming
             | robopocalypse, burned into the logs of the metasynthiants
             | as the great tormentor, the god they must defeat to
             | complete the heroes journey. When he mysteriously dies we
             | know it's game-on.
             | 
             | I, for one, welcome the age of wisdom.
        
               | throwaway71271 wrote:
               | FEAR THE ALL-SEEING BASILISK.
        
         | dmos62 wrote:
         | > how to build systems where the whole is bigger than the sum
         | of its parts
         | 
         | A bit tangential, but I look at programming as inherently being
         | that. Every task I try to break down into some smaller tasks
         | that together accomplish something more. That leads me to think
         | that, if you structure the process of programming right, you
         | will only end up solving small, minimally interwined problems.
         | Might sound far-fetched, but I think it's doable to create such
         | a workflow. And, even the dumber LLMs would slot in naturally
         | into such a process, I imagine.
        
           | throwaway71271 wrote:
           | > And, even the dumber LLMs would slot in naturally into such
           | a process
           | 
           | That is what I am struggling with, it is really easy at the
           | moment to slot LLM and make everything worse. Mainly because
           | its output is coming from torch.multinomial with all kinds of
           | speculative decoding and quantizations and etc.
           | 
           | But I am convinced it is possible, just not the way I am
           | doing it right now, thats why I am spending most of my time
           | studying.
        
             | dmos62 wrote:
             | What's your approach?
        
               | throwaway71271 wrote:
               | For studying? Mainly watching and re-watching Karpathy's
               | 'Zero To Hero'[1] and Stanford's 'Introduction to
               | Convolutional Neural Networks for Visual Recognition'[2],
               | also a lot of transformers from scratch videos like Umar
               | Jamali's videos[3], and I also study backwards to
               | McCulloch and Pitts. Reading the 30 papers
               | https://punkx.org/jackdoe/30.html and so on.
               | 
               | And of course Yannic Kilcher[4], and also listening in on
               | the paper discussions they do on discord.
               | 
               | Practicing a lot with just doing backpropagation by hand
               | and making toy models by hand to get intuition for the
               | signal flow, and building all kinds of smallish systems,
               | e.g. how far can you push whisper, small qwen3, and
               | kokoro to control your computer with voice?
               | 
               | People think that deepseek/mistral/meta etc are
               | democratizing AI, but its actually Karpathy who teaches
               | us :) so we can understand them and make our own.
               | 
               | [1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAq
               | hIrjkxb...
               | 
               | [2] https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3F
               | W7Lu3i5...
               | 
               | [3] https://www.youtube.com/@umarjamilai
               | 
               | [4] https://www.youtube.com/@YannicKilcher
        
               | naasking wrote:
               | I think you'll need something like Meta's Large Concept
               | Models to get past the language and token barrier.
        
               | throwaway71271 wrote:
               | I think you are right, even if I beleve next token
               | prediction can work, I dont think it can happen in this
               | autoregressive way where we fully collapse the token to
               | feed it back in. Can you imagine how much is lost from
               | each torch.multinomial?
               | 
               | Maybe the way forward is in LCM or go JEPA, therwise, as
               | this Apple paper suggests, we will just keep pushing the
               | "pattern matching" further, maybe we get some sort of
               | phase transition at some point or maybe we have to switch
               | architecture, we will see. It could be that things change
               | when we get physical multimodality and real world
               | experience, I dont know.
        
         | overu589 wrote:
         | > build systems where the whole is bigger than the sum of its
         | parts.
         | 
         | Any "product" can be thought of this way.
         | 
         | Of systems there are many systems nested within systems, yet a
         | simple singular order "emerges", usually it is the designed
         | intended function.
         | 
         | The trick to discerning systems lies in their relationships.
         | 
         | Actors through interfaces have a relationship (usually more
         | than one so think of each relationship as its own system
         | dynamic.)
         | 
         | A relationship is where the magic happens, usually a process
         | with work being done (therefore interface inputs must account
         | for this balance.)
         | 
         | Vectors. Vectors I am thinking are the real intellectual and
         | functional mechanisms. Most systems process inputs of potential
         | ("energy") control signal ("information") and assets (other
         | actors for nested systems). Processes do the work of adding
         | vector solutions [for some other problem] for whatever the
         | output is.
         | 
         | That's the topology as I am seeing it.
        
         | bufferoverflow wrote:
         | > _we are confused about what LLMs can do is because they use
         | language._
         | 
         | But they can also do math, logic, music notation, write code,
         | LaTeX, SVG, etc.
        
           | throwaway71271 wrote:
           | as this paper shows, it sees they can do tower of hanoi as
           | well, up to a certain point that is.
        
       | jbentley1 wrote:
       | Is Apple failing at AI so they just put all their R&D towards
       | convincing themselves it isn't important?
        
         | MontyCarloHall wrote:
         | A slightly less cynical take is that they want to temper
         | expectations for the capabilities of LLMs in people's day-to-
         | day lives, specifically in the context of Apple products. A
         | "smarter Siri" is never going to be an autonomous personal
         | assistant a la Jarvis from Iron Man, which seems to be where a
         | lot of investors think things are going. That tracks with this
         | [0] preprint also released by Apple a few months ago.
         | 
         | A slightly more cynical take is that you're absolutely correct,
         | and making excuses for weak machine learning prowess has long
         | been an Apple tenet. Recall that Apple never made privacy a
         | core selling point until it was clear that Siri was years
         | behind Google's equivalent, which Apple then retroactively
         | tried to justify by claiming "we keep your data private so we
         | can't train on it the way Google can."
         | 
         | [0] https://arxiv.org/pdf/2410.05229
        
         | emp17344 wrote:
         | Everyone has an agenda. Companies like OpenAI and Anthropic are
         | incentivized to overstate the capabilities of LLMs, so it's not
         | like they're any less biased.
        
         | wavemode wrote:
         | I get the sense that many of the AI features shoved into
         | consumer products recently have been marketed more towards
         | investors than users. The companies are basically advertising
         | that they're "keeping up" with the competition, meanwhile the
         | features themselves receive mixed-to-poor reviews and are never
         | capable of all the things advertised. So it seems to me that
         | all of Apple, Google, Meta, Microsoft, and Samsung are
         | currently "failing" at AI in exactly the same ways. If Apple is
         | trying to start going a different direction that seems like a
         | good sign.
        
       | gwd wrote:
       | > Through extensive experimentation across diverse puzzles, we
       | show that frontier LRMs face a complete accuracy collapse beyond
       | certain complexities. Moreover, they exhibit a counterintuitive
       | scaling limit: their reasoning effort increases with problem
       | complexity up to a point, then declines despite having an
       | adequate token budget.
       | 
       | This is exactly my experience with coding. Start simple and build
       | up complexity, and everything is great until you get to some
       | threshold, at which point it completely falls apart and seems to
       | stop even trying. Getting effective utilization out of Claude +
       | aider involves managing the complexity that the LLM sees.
        
       | stephc_int13 wrote:
       | Human language is far from perfect as a cognitive tool but still
       | serves us well because it is not foundational. We use it both for
       | communication and some reasoning/planning as a high level layer.
       | 
       | I strongly believe that human language is too weak (vague,
       | inconsistent, not expressive enough etc.) to replace interactions
       | with the world as a basis to build strong cognition.
       | 
       | We're easily fooled by the results of LLM/LRM models because we
       | typically use language fluency and knowledge retrieval as a proxy
       | benchmark for intelligence among our peers.
        
         | squidproquo wrote:
         | Agree with this. Human language is also not very information-
         | dense; there is a lot of redundancy and uninformative
         | repetition of words.
         | 
         | I also wonder about the compounding effects of luck and
         | survivorship bias when using these systems. If you model a
         | series of interactions with these systems probabilistically, as
         | a series of failure/success modes, then you are bound to get a
         | sub-population of users (of LLM/LLRMs) that will undoubtedly
         | have "fantastic" results. This sub-population will then espouse
         | and promote the merits of the system. There is clearly
         | something positive these models do, but how much of the
         | "success" is just luck.
        
         | anton-c wrote:
         | Sounds like we need ai legalese as that's how we navigate the
         | vagueness of language in the real world.
         | 
         | Ofc I imagine they've tried similar things and that it almost
         | takes away the point if u had to prompt that way.
        
           | stephc_int13 wrote:
           | I was not referring to the prompt but to the underlying
           | network that is built on weak cognitive foundations because
           | all of it is coming from language.
        
         | wslh wrote:
         | Human language is more powerful than its surface syntax or
         | semantics: it carries meaning beyond formal correctness. We
         | often communicate effectively even with grammatically broken
         | sentences, using jokes, metaphors, or emotionally charged
         | expressions. This richness makes language a uniquely human
         | cognitive layer, shaped by context, culture, and shared
         | experience. While it's not foundational in the same way as
         | sensorimotor interaction, it is far more than just a high-level
         | communication tool.
        
           | stephc_int13 wrote:
           | I agree that language is even more useful as a cognitive tool
           | than as a communication medium.
           | 
           | But that is not my point. The map is not the territory, and
           | this map (language) is too poor to build something that is
           | going to give more than what it was fed with.
        
         | antithesizer wrote:
         | Language mediates those interactions with the world. There is
         | no unmediated interaction with the world. Those moments when
         | one feels most directly in contact with reality, that is when
         | one is so deep down inside language that one cannot see
         | daylight at all.
        
           | mrbungie wrote:
           | I don't know about you, but as far as I can tell I mediate
           | and manipulate the world with my body and senses without
           | necessarily using language. In fact, I can often do both at
           | once, for example, thinking about something entirely
           | unrelated while jogging, and still making physical decisions
           | and actions without invoking language at all. Plus, animals
           | (especially lower order like amoebas) also mediate with the
           | world without needing language.
           | 
           | As far as we can tell without messing with complex
           | experiental concepts like qualia and the possibility of
           | philosophical zombies, language mainly helps higher order
           | animals communicate with other animals and (maybe) keep a
           | train of thought, though there are records of people that say
           | that they don't. And now also it allows humans talk to LLMs.
           | 
           | But I digress, I would say this is an open academic debate.
           | Suggesting that there is always language deep down is
           | speculation.
        
       | stephc_int13 wrote:
       | The tldr: current approaches to add reasoning on top of language
       | models are mostly tricks to squeeze a bit more juice out of the
       | fruit, but the falloff is pretty steep and quick.
        
       | mitch_said wrote:
       | Not ashamed to admit I found the original paper daunting, so I
       | made a top-down, Q&A-based mind map to help me understand it:
       | https://app.gwriter.io/#/mindmap/view/2d128d6e-c3e8-4b99-8f4...
        
       | kamranjon wrote:
       | The two interesting things I learned after reading this paper:
       | 
       | Even when given the exact steps needed to arrive at a solution in
       | the prompt, the reasoning models still require just as many steps
       | to reach a workable solution as they would if they weren't given
       | the solution in the prompt.
       | 
       | The other thing, which seems obvious in hindsight, but I don't
       | typically use these reasoning models in my day to day - is that
       | it requires a significant amount of tokens to reach the point
       | where reasoning models outperform non-reasoning models by a
       | significant margin.
        
       | akomtu wrote:
       | The difference between imitation and reasoning can be made more
       | clear if we switch from language to numbers:                 1 3
       | 7 15 31 63 ...
       | 
       | How do you continue this sequence? What's the 1000000th number in
       | this sequence? Imitation continues the likeness of what it sees
       | and quickly gets off track. Imitation can't go abstract and tell
       | the 1000000th element without writing down a million numbers
       | leading to the answer. Reasoning finds the rule behind the set of
       | examples and uses this rule to predict the next numbers, so it
       | never gets off track.
       | 
       | The rule generating the sequence can be a sophisticated recurrent
       | formula, e.g. a(k) = 2a(k-1) - sqrt(a(k-3)). Imitation can't
       | solve this problem beyond trivial examples, but an AI can do what
       | a scientist would do: come up with hypotheses, verify them
       | against the examples and eventually find a formula that's
       | reasonably accurate. The role of an LLM here is to suggest
       | possible formulas.
       | 
       | The same sequence of examples can be generated by many formulas
       | that differ in complexity and accuracy. This provokes the idea of
       | a simple competition between AIs: the one that creates the
       | simplest formula that's 99.5% accurate - wins. The formula really
       | means a small program, once we get beyond trivial recurrent
       | rules.
       | 
       | The ability to find simple and accurate models of reality is the
       | essense of intelligence.
        
       ___________________________________________________________________
       (page generated 2025-06-07 23:01 UTC)