[HN Gopher] The Illusion of Thinking: Understanding the Limitati...
___________________________________________________________________
The Illusion of Thinking: Understanding the Limitations of
Reasoning LLMs [pdf]
Author : amrrs
Score : 316 points
Date : 2025-06-06 18:18 UTC (1 days ago)
(HTM) web link (ml-site.cdn-apple.com)
(TXT) w3m dump (ml-site.cdn-apple.com)
| behnamoh wrote:
| Okay Apple, you got my attention. But I'm a strong proponent of
| "something is better than nothing" philosophy--even if
| OpenAI/Google/etc. are building reasoning models with the
| limitations that you describe, they are still a huge progress
| compared to what we had not long ago. Meanwhile you're not even
| trying.
|
| It's so easy to criticize the works of others and not deliver
| anything. Apple--be Sam in Game of Thrones: "I'm tired of reading
| about the achievements of better men".
| suddenlybananas wrote:
| I think you're mistaking the work of researchers who work at
| Apple with the particular investment decisions of Apple over
| the past few years.
|
| >It's so easy to criticize the works of others and not deliver
| anything. Apple--be Sam in Game of Thrones: "I'm tired of
| reading about the achievements of better men".
|
| This is a patently absurd thing to write about a research
| paper.
| bwfan123 wrote:
| there is enough hype already - with AGI being promised as
| imminent.
|
| this work balances the hype and shows fundamental limitations
| so the AI hypesters are checked.
|
| why be salty ?
| ivape wrote:
| This is easily explained by accepting that there is no such thing
| as LRMs. LRMs are just LLMs that iterate on its own answers more
| (or provides itself more context information of a certain type).
| The reasoning loop on an "LRM" will be equivalent to asking a
| regular LLM to "refine" its own response, or "consider"
| additional context of a certain type. There is no such thing as
| _reasoning_ basically, as it was always a method to "fix"
| hallucinations or provide more context automatically, nothing
| else. These big companies baked in one of the hackiest prompt
| engineering tricks that your typical enthusiast figured out long
| ago and managed to brand it and profit off it. The craziest part
| about this was Deepseek was able to cause a multi billion dollar
| drop and pump of AI stocks with this _one trick_. Crazy times.
| AlienRobot wrote:
| Is that what "reasoning" means? That sounds pretty ridiculous.
|
| I've thought before that AI is as "intelligent" as your
| smartphone is "smart," but I didn't think "reasoning" would be
| just another buzzword.
| ngneer wrote:
| I am not too familiar with the latest hype, but "reasoning"
| has a very straightforward definition in my mind. For
| example, can the program in question derive new facts from
| old ones in a logically sound manner. Things like applying
| modus ponens. (A and A => B) => B. Or, all men are mortal and
| Socrates is a man, and therefore Socrates is mortal. If the
| program cannot deduce new facts, then it is not reasoning, at
| least not by my definition.
| dist-epoch wrote:
| When people say LLMs can't do X, I like to try it.
| Q: Complete 3 by generating new knowledge: 1. today
| is warm 2. cats likes warm temperatures 3.
|
| A: Therefore, a cat is likely to be enjoying the weather
| today.
|
| Q: does the operation to create new knowledge you did have
| a specific name?
|
| A: ... Deductive Reasoning
|
| Q: does the operation also have a Latin name?
|
| A: ... So, to be precise, you used a syllogismus
| (syllogism) that takes the form of Modus Ponens to make a
| deductio (deduction).
|
| https://aistudio.google.com/app/prompts/1LbEGRnzTyk-2IDdn53
| t...
|
| People then say "of course it could do that, it just
| pattern matched a Logic text book. I meant in a real
| example, not an artificially constructed one like this one.
| In a complex scenario LLMs obviously can't do Modus Ponens.
| ngneer wrote:
| I do not know whether the state of the art is able to
| reason or not. The textbook example you gave is
| admittedly not very interesting. What you are hearing
| from people is that parroting is not reasoning, which is
| true.
|
| I wonder if the state of the art can reason its way
| through the following:
|
| "Adam can count to 14000. Can Adam count to 13500?"
|
| The response needs to be affirmative for every X1 and X2
| such that X2 <= X1. That is reasoning. Anything else is
| not reasoning.
|
| The response when X2 > X1 is less interesting. But, as a
| human it might be "Maybe, if Adam has time" or "Likely,
| since counting up to any number uses the same algorithm"
| or "I don't know".
|
| Seems ChatGPT can cope with this. Other examples are easy
| to come up with, too. There must be benchmarks for this.
|
| Input to ChatGPT:
|
| "Adam can lift 1000 pounds of steel. Can Adam lift 1000
| pounds of feathers?"
|
| Output from ChatGPT:
|
| "1,000 pounds of feathers would be much easier for Adam
| to lift compared to 1,000 pounds of steel, because
| feathers are much lighter and less dense."
|
| So, maybe not there yet...
| dist-epoch wrote:
| > "Adam can lift 1000 pounds of steel. Can Adam lift 1000
| pounds of feathers?"
|
| Worked for me:
|
| https://chatgpt.com/share/6844813a-6e4c-8006-b560-c0be223
| eeb...
|
| gemma3-27b, a small model, had an interesting take:
|
| > This is a classic trick question!
|
| > While Adam can lift 1000 pounds, no, he likely cannot
| lift 1000 pounds of feathers.
|
| > Volume: Feathers take up a huge amount of space for
| their weight. 1000 pounds of feathers would be an
| enormous volume - likely far too large for Adam to even
| get under, let alone lift. He'd be trying to lift a
| massive, bulky cloud.
|
| > Practicality: Even if he could somehow get it under a
| barbell, the feathers would shift and compress, making a
| secure grip impossible.
|
| > The question plays on our understanding of weight
| versus volume. It's designed to make you focus on the
| "1000 pounds" and forget about the practicalities of
| lifting something so voluminous.
|
| Tried the counting question on the smallest model,
| gemma-3n-34b, it can run on a smartphone:
|
| > Yes, if Adam can count to 14000, he can definitely
| count to 13500. Counting to a smaller number is a basic
| arithmetic operation. 13500 is less than 14000.
| ngneer wrote:
| Thanks for trying these out :). Highlights the often
| subtle difference between knowing the answer and deducing
| the answer. Feathers could be ground into a pulp and
| condensed, too. I am not trying to be clever, just seems
| like the response is a canned answer.
| JSR_FDED wrote:
| A reasoning model is an LLM that has had additional training
| phases that reward problem solving abilities. (But in a black
| box way - it's not clear if the model is learning actual
| reasoning or better pattern matching, or memorization, or
| heuristics... maybe a bit of everything).
| meroes wrote:
| Yep. This is exactly the conclusion I reached as an RLHF'er.
| Reasoning/LRM/SxS/CoT is "just" more context. There never was
| reasoning. But of course, more context can be good.
| Too wrote:
| The million dollar question is how far can one get on this
| trick. Maybe this is exactly how our own brains operate? If
| not, what fundamental building blocks are missing to get there.
| bwfan123 wrote:
| > If not, what fundamental building blocks are missing to get
| there
|
| If I were to guess, the missing building block is the ability
| to abstract - which is the ability to create a symbol to
| represent something. Concrete example of abstraction is seen
| in the axioms of lambda calculus. 1) ability to posit a
| variable, 2) ability to define a function using said
| variable, and 3) the ability to apply functions to things.
| Abstraction arises from a process in the brain which we have
| not understood yet and could be outside of computation as we
| know it per [1]
|
| [1] https://www.amazon.com/Emperors-New-Mind-Concerning-
| Computer...
| JusticeJuice wrote:
| Their finding of LLMs working best at simple tasks, LRMs working
| best at medium complexity tasks, and then neither succeeding at
| actually complex tasks is good to know.
| cubefox wrote:
| Not sure whether I sense sarcasm.
| nialv7 wrote:
| I've seen this too often, papers that ask questions they don't
| even bother to properly define.
|
| > Are these models capable of generalizable reasoning, or are
| they leveraging different forms of pattern matching?
|
| Define reasoning, define generalizable, define pattern matching.
|
| For additional credits after you have done so, show humans are
| capable of what you just defined as generalizable reasoning.
| NitpickLawyer wrote:
| > show humans are capable of what you just defined as
| generalizable reasoning.
|
| I would also add "and plot those capabilities on a curve". My
| intuition is that the SotA models are already past the median
| human abilities in _a lot_ of areas.
| crvdgc wrote:
| In the context of this paper, I think "generalizable reasoning"
| means that finding a method to solve the puzzle and thus being
| able to execute the method on puzzle instances of arbitrary
| complexity.
| beneboy wrote:
| This kind of explains why Claude will find the right solution,
| but then the more it thinks and keeps "improving" the more over-
| engineered (and sometimes wrong) the solution is. Interesting to
| see this coming up in formal research.
| bicepjai wrote:
| The study challenges the assumption that more "thinking" or
| longer reasoning traces necessarily lead to better problem-
| solving in LRMs
| bayindirh wrote:
| As a test, I asked Gemini 2.5 Flash and Gemini 2.5 Pro to
| decode a single BASE64 string.
|
| Flash answered _correctly_ in ~2 seconds, at most. Pro answered
| _very wrongly_ after thinking and elaborating for ~5 minutes.
|
| Flash was also giving a wrong answer for the same string in the
| past, but it improved.
|
| Prompt was the same: "Hey, can you decode $BASE64_string?"
|
| I have no further comments.
| rafterydj wrote:
| well that's not a very convincing argument. That's just a
| failure to recognize when the use of a tool- base64 decoder-
| is needed, not a reasoning problem at all, right?
| BoorishBears wrote:
| That's not really a cop out here: both models had access to
| the same tools.
|
| Realistically there are many problems that non-reasoning
| models do better on, especially when the answer cannot be
| solved by a thought process: like recalling internal
| knowledge.
|
| You can try to teach the model the concept of a problem
| where thinking will likely steer it away from the right
| answer, but at some point it becomes like the halting
| problem... how does the model reliably think its way into
| the realization a given problem is too complex to be
| thought out?
| bayindirh wrote:
| I don't know whether Flash uses a tool or not, but it
| answers pretty quickly. However, Pro opts to use its own
| reasoning, not a tool. When I look at the reasoning train,
| it pulls and pulls knowledge endlessly, refining that
| knowledge and drifting away.
| Jensson wrote:
| Translating to BASE64 is a good test to see how well it
| works as a language translator without changing things,
| because its the same skill for an AI model.
|
| If the model changes things it means it didn't really
| capture the translation patterns for BASE64, so then who
| knows what it will miss when translating between languages
| if it can't even do BASE64?
| layer8 wrote:
| A moderately smart human who understands how Base64 works
| can decode it by hand without external tools other than pen
| and paper. Coming up with the exact steps to perform is a
| reasoning problem.
| actinium226 wrote:
| Man, remember when everyone was like 'AGI just around the
| corner!' Funny how well the Gartner hype cycle captures these
| sorts of things
| bayindirh wrote:
| They're similar to self-driving vehicles. Both are around the
| corner, but neither can negotiate the turn.
| einrealist wrote:
| All that to keep the investment pyramid schemes going.
| kfarr wrote:
| Waymo's pretty good at unprotected lefts
| bayindirh wrote:
| Waymo is pretty good at (a finite number of) unprotected
| lefts, and this doesn't count as "level 5 autonomous
| driving".
| hskalin wrote:
| And commerically viable nuclear fusion
| mgiampapa wrote:
| I harvest fusion energy every single day... It's just there
| in the sky, for free!
| nmca wrote:
| I saw your comment and counted -- in May I took a Waymo
| thirty times.
| bayindirh wrote:
| Waymo is a popular argument in self-driving cars, and they
| do well.
|
| However, Waymo is Deep Blue of self-driving cars. Doing
| _very well_ in a _closed space_. As a result of this
| geofencing, they have effectively exhausted their search
| space, hence they work well as a consequence of lack of
| surprises.
|
| AI works well when search space is limited, but _General_
| AI in any category needs to handle a vastly larger search
| space, and they fall flat.
|
| At the end of the day, AI is informed search. They get
| inputs, and generate a suitable output as deemed by their
| trainers.
| yahoozoo wrote:
| We will be treating LLMs "like a junior developer" forever.
| JKCalhoun wrote:
| And I'm fine with that.
| sneak wrote:
| Even if they never get better than they are today (unlikely)
| they are still the biggest change in software development and
| the software development industry in my 28 year career.
| tonyhart7 wrote:
| I think we just around at 80% of progress
|
| the easy part is done but the hard part is so hard it takes
| years to progress
| georgemcbay wrote:
| > the easy part is done but the hard part is so hard it takes
| years to progress
|
| There is also no guarantee of continued progress to a
| breakthrough.
|
| We have been through several "AI Winters" before where
| promising new technology was discovered and people in the
| field were convinced that the breakthrough was just around
| the corner and it never came.
|
| LLMs aren't quite the same situation as they do have some
| undeniable utility to a wide variety of people even without
| AGI springing out of them, but the blind optimism that surely
| progress will continue at a rapid pace until the assumed
| breakthrough is realized feels pretty familiar to the hype
| cycle preceding past AI "Winters".
| Swizec wrote:
| > We have been through several "AI Winters" before
|
| Yeah, remember when we spent 15 years (~2000 to ~2015)
| calling it "machine learning" because AI was a bad word?
|
| We use _so much_ AI in production every day but nobody
| notices because as soon as a technology becomes useful, we
| stop calling it AI. Then it's suddenly "just face
| recognition" or "just product recommendations" or "just
| [plane] autopilot" or "just adaptive cruise control" etc
|
| You know a technology isn't practical yet because it's
| still being called AI.
| blks wrote:
| I don't think there's any "AI" in aircraft autopilots.
| withinboredom wrote:
| AI encompasses a wide range of algorithms and techniques;
| not just LLMs or neural nets. Also, it is worth pointing
| out that the definition of AI has changed drastically
| over the last few years and narrowed pretty
| significantly. If you're viewing the definition from the
| 80-90's, most of what we call "automation" today would
| have been considered AI.
| Jensson wrote:
| Autopilots were a thing before computers were a thing,
| you can implement one using mechanics and control theory.
| So no, traditional autopilots are not AI under any
| reasonable definition, otherwise every single machine we
| build would be considered AI as almost all machines has
| some form of control systems in them, for example is your
| microwave clock an AI?
|
| So I'd argue any algorithm that comes from control theory
| is not AI, those are just basic old dumb machines. You
| can't make planes without control theory, humans can't
| keep a plane steady without it, so Wrights Brothers
| adding this to their plane is why they succeeded making a
| flying machine.
|
| So if autopilots are AI then the Wrights Brothers
| developed an AI to control their plane. I don't think
| anyone sees that as AI, not even at the time they did the
| first flight.
| trc001 wrote:
| Uh, the bellman equation was first used for control
| theory and is the foundation of modern reinforcement
| learning... so wouldn't that imply LLMs "come from"
| control theory?
| roenxi wrote:
| What do you think has changed? The situation is still about as
| promising for AGI in a few years - if not more so. Papers like
| this are the academics mapping out where the engineering
| efforts need to be directed to get there and it seems to be a
| relatively small number of challenges that are easier as the
| ones already overcome - we know machine learning can solve
| Towers of Hanoi, for example. It isn't fundamentally
| complicated like Baduk is. The next wall to overcome is more of
| a low fence.
|
| Besides, AI already passes the Turing test (or at least, is
| most likely to fail because it is too articulate and
| reasonable). There is a pretty good argument we've already
| achieved AGI and now we're working on achieving human- and
| superhuman-level intelligence in AGI.
| MoonGhost wrote:
| > What do you think has changed? The situation is still about
| as promising for AGI in a few years - if not more so
|
| It's better today. Hoping that LLMs can get us to AGI in one
| hop was naive. Depending on definition of AGI we might be
| already there. But for superhuman level in all possible tasks
| there are many steps to be done. The obvious way is to find a
| solution for each type of tasks. We have already for math
| calculations, it's using tools. Many other types can be
| solved the same way. After a while we'll gradually get to
| well rounded 'brain', or model(s) + support tools.
|
| So, so far future looks bright, there is progress, problems,
| but not deadlocks.
|
| PS: Turing test is a <beep> nobody seriously talks about
| today.
| latchup wrote:
| To be fair, the technology sigmoid curve rises fastest just
| before its inflection point, so it is hard to predict at what
| point innovation slows down due to its very nature.
|
| The first Boeing 747 was rolled out in 1968, only 65 years
| after the first successful heavier-than-air flight. If you told
| people back then that not much will fundamentally change in
| civil aviation over the next 57 years, no one would have
| believed you.
| PantaloonFlames wrote:
| And not just in aviation. Consider what aviation did to make
| the world smaller. Huge 2nd order changes. The COVID-19
| pandemic would not have happened the way it did, if there
| were no Boeing or Airbus.
|
| Big hard-to-predict changes ahead.
| brookst wrote:
| ...but that was, like, two years ago? If we go from GPT2 to AGI
| in ten years that will still feel insanely fast.
| mirekrusin wrote:
| I remember "stochastic parrot" and people saying it's fancy
| markov chain/dead end. You don't hear them much after roughly
| agentic coding appeared.
| Marazan wrote:
| Spicy autocomplete is still spicy autocomplete
| mirekrusin wrote:
| I'm not sure if system capable of ie. reasoning over images
| deserves this label anymore?
| mrbungie wrote:
| The thing is "spicy" or "glorified" autocomplete are not
| actually bad labels, they are autocomplete machines that
| are very good up to the point of convincing people that
| they think.
| PantaloonFlames wrote:
| Yours seems like a c.2023 perspective of coding assistants.
| These days it's well beyond autocomplete and "generate a
| function that returns the numbers from the Fibonacci
| sequence."
|
| But I would think that would be well understood here.
|
| How can you reduce what is currently possible to spicy
| autocomplete? That seems pretty dismissive, so much so that
| I wonder if it is motivated reasoning on your part.
|
| I'm not saying it's good or bad; I'm just saying the
| capability is well beyond auto complete.
| otabdeveloper4 wrote:
| AGI has always been "just around the corner", ever since
| computers were invented.
|
| Some problems have become more tractable (e.g. language
| translation), mostly by lowering our expectations of what
| constitutes a "solution", but AGI is nowhere nearer. AGI is a
| secular milleniarist religion.
| naasking wrote:
| Interpreting "just around the corner" as "this year" sounds
| like your error. Most projections are are years out, at least.
| IshKebab wrote:
| Yeah it's already been 21/2 years! How long does it take to
| develop artificial life anyway? Surely no more than 3 years? I
| demand my money back!
| alansammarone wrote:
| I have a somewhat similar point of view to the one voiced by
| other people, but I like to think about it slightly differently,
| so I'll chime in - here's my take (although, admittedly, I'm
| operating with a quite small reasoning budget (5 minutes tops)):
|
| Time and again, for centuries - with the pace picking up
| dramatically in recent decades - we thought we were special and
| we were wrong. Sun does not rotate around the earth, which is a
| pretty typical planet, with the same chemical composition of any
| other planet. All of a sudden we're not the only ones who could
| calculate, then solve symbolic equations, then play chess, then
| compose music, then talk, then reason (up to a point, for some
| definition of "reason"). You get my point.
|
| And when we were not only matched, but dramatically surpassed in
| these tasks (and not a day earlier), we concluded that they
| weren't _really_ what made us special.
|
| At this point, it seems to me reasonable to assume we're _not_
| special, and the onus should be on anybody claiming that we are
| to at least attempt to mention in passing what is the secret
| sauce that we have (even if we can't quite say what it is without
| handwaving or using concepts that by definition can not be
| defined - "qualia is the indescribable feeling of red - its
| redness (?)).
|
| Oh, and sorry, I could never quite grasp what "sentient" is
| supposed to mean - would we be able to tell we're not sentient if
| we weren't?
| ivape wrote:
| I can give you a pretty wild explanation. Einstein was a freak
| of nature. Nature just gave him that "something" to figure out
| the laws of the universe. I'm avoiding the term God as to not
| tickle anyone incorrectly. Seriously, explain what schooling
| and environment gets you that guy. So, to varying degrees, all
| output is from the universe. It's hard for the ego to accept,
| surely we earned everything we ever produced ...
|
| Spooky stuff.
| keiferski wrote:
| This analogy doesn't really work, because the former examples
| are ones in which humanity discovered that it existed in a
| larger world.
|
| The recent AI example is humanity building, or attempting to
| build, a tool complex enough to mimic a human being.
|
| If anything, you could use recent AI developments as proof of
| humanity's uniqueness - what other animal is creating things of
| such a scale and complexity?
| suddenlybananas wrote:
| I don't see how heliocentrisim or calculators have any bearing
| on the uniqueness of humans.
| curious_cat_163 wrote:
| > Rather than standard benchmarks (e.g., math problems), we adopt
| controllable puzzle environments that let us vary complexity
| systematically
|
| Very clever, I must say. Kudos to folks who made this particular
| choice.
|
| > we identify three performance regimes: (1) low complexity tasks
| where standard models surprisingly outperform LRMs, (2) medium-
| complexity tasks where additional thinking in LRMs demonstrates
| advantage, and (3) high-complexity tasks where both models
| experience complete collapse.
|
| This is fascinating! We need more "mapping" of regimes like this!
|
| What I would love to see (not sure if someone on here has seen
| anything to this effect) is how these complexity regimes might
| map to economic value of the task.
|
| For that, the eval needs to go beyond puzzles but the complexity
| of the tasks still need to be controllable.
| pegasus wrote:
| Is (1) that surprising? If I ask someone a simple question but
| tell them to "think really hard about it", they'll be more
| likely to treat it as a trick question and look for a non-
| obvious answer. Overthinking it, basically.
| 8bitsrule wrote:
| Fusion has been 25 years away for all of my life.
| sneak wrote:
| Fusion is net positive energy now; that happened in 2022
| (+54%).
|
| In 2025 they got a 313% gain (4.13 output factor).
|
| Fusion is actually here and working. It's not cost effective
| yet but to pretend there has been no progress or achievements
| is fundamentally false.
| oneshtein wrote:
| It will be cost effective in just 25 years.
| sitkack wrote:
| Negative Negs spit out low effort snark, they said the same
| thing about solar, electric cars, even multicore, jit, open
| source. Thanks for refuting them, the forum software itself
| should either quarantine the response or auto respond before
| the comment is submitted. These people don't build the
| future.
|
| Fusion News, May 28th, 2025
| https://www.youtube.com/watch?v=1YHcI-SfKx8
| lrhegeba wrote:
| It isnt when you look at Q total. Total energy input for all
| needed support systems versus energy produced. See
| https://en.wikipedia.org/wiki/Fusion_energy_gain_factor for
| more details
| benlivengood wrote:
| These are the kind of studies that make so much more sense than
| the "LLMs can't reason because of this ideological argument or
| this one anecdote" posts/articles. Keep 'em coming!
|
| And also; the frontier LLMs blow older LLMs out of the water.
| There is continual progress and this study would have been
| structured substantially the same 2 years ago with much smaller N
| on the graphs because the regimes were much tinier then.
| antics wrote:
| I think the intuition the authors are trying to capture is that
| they believe the models are omniscient, but also dim-witted. And
| the question they are collectively trying to ask is whether this
| will continue forever.
|
| I've never seen this question quantified in a really compelling
| way, and while interesting, I'm not sure this PDF succeeds, at
| least not well-enough to silence dissent. I think AI maximalists
| will continue to think that the models are in fact getting less
| dim-witted, while the AI skeptics will continue to think these
| apparent gains are in fact entirely a biproduct of "increasing"
| "omniscience." The razor will have to be a lot sharper before
| people start moving between these groups.
|
| But, anyway, it's still an important question to ask, because
| omniscient-yet-dim-witted models terminate at "superhumanly
| assistive" rather than "Artificial Superintelligence", which in
| turn economically means "another bite at the SaaS apple" instead
| of "phase shift in the economy." So I hope the authors will
| eventually succeed.
| sitkack wrote:
| There is no reason that omniscient-yet-dim-witted has to
| plateau at human intelligence.
| antics wrote:
| I am not sure if you mean this to refute something in what
| I've written but to be clear I am not arguing for or against
| what the authors think. I'm trying to state why I think there
| is a disconnect between them and more optimistic groups that
| work on AI.
| drodgers wrote:
| I think that commenter was disagreeing with this line:
|
| > because omniscient-yet-dim-witted models terminate at
| "superhumanly assistive"
|
| It might be that with dim wits + enough brute force
| (knowledge, parallelism, trial-and-error, specialisation,
| speed) models could still substitute for humans and
| transform the economy in short order.
| antics wrote:
| Sorry, I can't edit it any more, but what I was trying to
| say is that if the authors are correct, that this
| distinction is philosophically meaningful, then that is
| the conclusion. If they are not correct, then all their
| papers on this subject are basically meaningless.
| Byamarro wrote:
| And we have a good example of a dimwitted, brute-force
| process creating intelligent designs - evolution.
| drodgers wrote:
| Also corporations, governments etc. - they're capable of
| things that none of the individuals could do alone.
| drodgers wrote:
| > I think AI maximalists will continue to think that the models
| are in fact getting less dim-witted
|
| I'm bullish (and scared) about AI progress precisely because I
| think they've only gotten a little less dim-witted in the last
| few years, but their practical capabilities have improved a
| _lot_ thanks to better knowledge, taste, context, tooling etc.
|
| What scares me is that I think there's a reasoning/agency
| capabilities overhang. ie. we're only one or two breakthroughs
| away from something which is both kinda omniscient (where we
| are today), and able to out-think you very quickly (if only
| through dint of applying parallelism to actually competent
| outcome-modelling and strategic decision making).
|
| That combination is terrifying. I don't think enough people
| have really imagined what it would mean for an AI to be able to
| out-strategise humans in the same way that they can now -- say
| -- out-poetry humans (by being both decent in terms of quality
| and _super_ fast). It 's like when you're speaking to someone
| way smarter than you and you realise that they're 6 steps
| ahead, and actively shaping your thought process to guide you
| where they want you to end up. At scale. For everything.
|
| This exact thing (better reasoning + agency) is also the top
| priority for all of the frontier researchers right now (because
| it's super useful), so I think a breakthrough might not be far
| away.
|
| Another way to phrase it: I think today's LLMs are about as
| good at snap judgements in most areas as the best humans
| (probably much better at everything that rhymes with inferring
| vibes from text), but they kinda suck at:
|
| 1. Reasoning/strategising step-by-step for very long periods
|
| 2. Snap judgements about reasoning or taking strategic actions
| (in the way that expert strategic humans don't actually need to
| think through their actions step-by-step very often - they've
| built intuition which gets them straight to the best answer 90%
| of the time)
|
| Getting good at the long range thinking might require more
| substantial architectural changes (eg. some sort of separate
| 'system 2' reasoning architecture to complement the already
| pretty great 'system 1' transformer models we have). OTOH, it
| might just require better training data and algorithms so that
| the models develop good enough strategic taste and agentic
| intuitions to get to a near-optimal solution quickly before
| they fall off a long-range reasoning performance cliff.
|
| Of course, maybe the problem is really hard and there's no easy
| breakthrough (or it requires 100,000x more computing power than
| we have access to right now). There's no certainty to be found,
| but a scary breakthrough definitely seems possible to me.
| sitkack wrote:
| I think you are right, and that the next step function can be
| achieved using the models we have, either by scaling the
| inference, or changing the way inference is done.
| danielmarkbruce wrote:
| People are doing all manner of very sophisticated inferency
| stuff now - it just tends to be extremely expensive for now
| and... people are keeping it secret.
| Jensson wrote:
| If it was good enough to replace people then it wouldn't
| be too expensive, they would have launched it and
| replaced a bunch of people and made trillions of dollars
| by now.
|
| So at best their internal models are still just
| performance multipliers unless some breakthrough happened
| very recently, it might be a bigger multiplier but that
| still keeps humans with jobs etc and thus doesn't
| revolutionize much.
| imiric wrote:
| > I think the intuition the authors are trying to capture is
| that they believe the models are omniscient, but also dim-
| witted.
|
| We keep assigning adjectives to this technology that
| anthropomorphize the neat tricks we've invented. There's
| nothing "omniscient" or "dim-witted" about these tools. They
| have no wit. They do not think or reason.
|
| All Large "Reasoning" Models do is generate data that they use
| as context to generate the final answer. I.e. they do real-time
| tuning based on synthetic data.
|
| This is a neat trick, but it doesn't solve the underlying
| problems that plague these models like hallucination. If the
| "reasoning" process contains garbage, gets stuck in loops,
| etc., the final answer will also be garbage. I've seen sessions
| where the model approximates the correct answer in the first
| "reasoning" step, but then sabotages it with senseless "But
| wait!" follow-up steps. The final answer ends up being a
| mangled mess of all the garbage it generated in the "reasoning"
| phase.
|
| The only reason we keep anthropomorphizing these tools is
| because it makes us feel good. It's wishful thinking that
| markets well, gets investors buzzing, and grows the hype
| further. In reality, we're as close to artificial intelligence
| as we were a decade ago. What we do have are very good pattern
| matchers and probabilistic data generators that can leverage
| the enormous amount of compute we can throw at the problem.
| Which isn't to say that this can't be very useful, but
| ascribing human qualities to it only muddies the discussion.
| antics wrote:
| I am not sure we are on the same page that the point of my
| response is that this paper is not enough to prevent exactly
| the argument you just made.
|
| In any event, if you want to take umbrage with this paper, I
| think we will need to back up a bit. The authors use a
| mostly-standardized definition of "reasoning", which is
| widely-accepted enough to support not just one, but several
| of their papers, in some of the best CS conferences in the
| world. I actually think you are right that it is reasonable
| to question this definition (and some people do), but I think
| it's going to be really hard for you to start that discussion
| here without (1) saying what your definition specifically is,
| and (2) justifying why its better than theirs. Or at the very
| least, borrowing one from a well-known critique like, _e.g._
| , Gebru's, Bender's, _etc_.
| Kon5ole wrote:
| >They have no wit. They do not think or reason.
|
| Computers can't think and submarines can't swim.
| Jensson wrote:
| But if you need a submarine that can swim as agiley as a
| fish then we still aren't there yet, fish are far superior
| to submarines in many ways. So submarines might be faster
| than fish, but there are so many maneuvers that fish can do
| that the submarine can't. Its the same with here with
| thinking.
|
| So just like computers are better at humans at multiplying
| numbers, there are still many things we need human
| intelligence for even in todays era of LLM.
| Kon5ole wrote:
| The point here (which is from a quote by Dijkstra) is
| that if the desired result is achieved (movement through
| water) it doesn't matter if it happens in a different way
| than we are used to.
|
| So if an LLM generates working code, correct
| translations, valid points relating to complex matters
| and so on it doesn't matter if it does so by thinking or
| by some other mechanism.
|
| I think that's an interesting point.
| Jensson wrote:
| > if the desired result is achieved (movement through
| water) it doesn't matter if it happens in a different way
| than we are used to
|
| But the point is that the desired result isn't achieved,
| we still need humans to think.
|
| So we still need a word for what humans do that is
| different from what LLM does. If you are saying there is
| no difference then how do you explain the vast difference
| in capability between humans and LLM models?
|
| Submarines and swimming is a great metaphor for this,
| since Submarines clearly doesn't swim and thus have very
| different abilities in water, its way better in some ways
| but way worse in other ways. So using that metaphor its
| clear that LLM "thinking" cannot be described with the
| same words as human thinking since its so different.
| Kon5ole wrote:
| >If you are saying there is no difference then how do you
| explain the vast difference in capability between humans
| and LLM models?
|
| No I completely agree that they are different, like
| swimming and propulsion by propellers - my point is that
| the difference may be irrelevant in many cases.
|
| Humans haven't been able to beat computers in chess since
| the 90s, long before LLM's became a thing. Chess engines
| from the 90s were not at all "thinking" in any sense of
| the word.
|
| It turns out "thinking" is not required in order to win
| chess games. Whatever mechanism a chess engine uses gets
| better results than a thinking human does, so if you want
| to win a chess game, you bring a computer, not a human.
|
| What if that also applies to other things, like
| translation of languages, summarizing complex texts,
| writing advanced algorithms, realizing implications from
| a bunch of seemingly unrelated scientific papers, and so
| on. Does it matter that there was no "thinking" going on,
| if it works?
| jplusequalt wrote:
| >So if an LLM generates working code
|
| It matters when code bases become hard to parse because
| the engineers throwing shit together with Cursor have
| made an ungrokkable ball of shit.
| naasking wrote:
| "Can't" is a pretty strong word, effectively entailing
| "never". Never is a long time to believe computers can't
| think.
| tim333 wrote:
| >There's nothing "omniscient" or "dim-witted" about these
| tools
|
| I disagree in that that seems quite a good way of describing
| them. All language is a bit inexact.
|
| Also I don't buy we are no closer to AI than ten years ago -
| there seem lots going on. Just because LLMs are limited
| doesn't mean we can't find or add other algorithms - I mean
| look at alphaevolve for example https://www.technologyreview.
| com/2025/05/14/1116438/google-d...
|
| >found a faster way to solve matrix multiplications--a
| fundamental problem in computer science--beating a record
| that had stood for more than 50 years
|
| I figure it's hard to argue that that is not at least
| somewhat intelligent?
| imiric wrote:
| > I figure it's hard to argue that that is not at least
| somewhat intelligent?
|
| The fact that this technology can be very useful doesn't
| imply that it's intelligent. My argument is about the
| language used to describe it, not about its abilities.
|
| The breakthroughs we've had is because there is a lot of
| utility from finding patterns in data which humans aren't
| very good at. Many of our problems can be boiled down to
| this task. So when we have vast amounts of data and compute
| at our disposal, we can be easily impressed by results that
| seem impossible for humans.
|
| But this is not intelligence. The machine has no semantic
| understanding of what the data represents. The algorithm is
| optimized for generating specific permutations of tokens
| that match something it previously saw and was rewarded
| for. Again, very useful, but there's no thinking or
| reasoning there. The model doesn't have an understanding of
| why the wolf can't be close to the goat, or how a cabbage
| tastes. It's trained on enough data and algorithmic tricks
| that its responses can fool us into thinking it does, but
| this is just an illusion of intelligence. This is why we
| need to constantly feed it more tricks so that it doesn't
| fumble with basic questions like how many "R"s are in
| "strawberry", or that it doesn't generate racially diverse
| but historically inaccurate images.
| tim333 wrote:
| I imagine if you asked the LLM why the wolf can't be
| close to the goat it would give a reasonable answer. I
| realise it does it by using permutation of tokens but I
| think you have to judge intelligence by the results
| rather than the mechanism otherwise you could argue
| humans can't be intelligent because they are just a bunch
| of neurons that find patterns.
| Jensson wrote:
| We have had programs that can give good answers to some
| hard questions for a very long time now. Watson won
| jeapordy already 2011, but it still wasn't very good at
| replacing humans.
|
| So that isn't a good way to judge intelligence, computers
| are so fast and have so much data that you can make
| programs to answer just about anything pretty well, LLM
| is able to do that but more automatic. But it still
| doesn't automate the logical parts yet, just the lookup
| of knowledge, we don't know how to train large logic
| models, just large language models.
| eMPee584 wrote:
| LLMs are not the only model type though? There's a
| plethora of architectures and combinations being
| researched.. And even transformers start to be able to do
| cool sh1t on knowledge graphs, also interesting is
| progress on autoregressive physics PDE (partial
| differential equations) models.. and can't be too long
| until some providers of actual biological neural nets
| show up on openrouter (probably a lot less energy and
| capital intense to scale up brain goo in tanks compared
| to gigawatt GPU clusters).. combine that zoo of "AI"
| specimen using M2M, MCP etc. and the line between mock
| and "true"intelligence will blur, escalating our feable
| species into ASI territory.. good luck to us.
| Jensson wrote:
| > There's a plethora of architectures and combinations
| being researched
|
| There were plethora of architectures and combinations
| being researched before LLM, still took a very long time
| to find LLM architecture.
|
| > the line between mock and "true"intelligence will blur
|
| Yes, I think this will happen at some point. The question
| is how long it will take, not if it will happen.
|
| The only thing that can stop this is if intermediate AI
| is good enough to give every human a comfortable life but
| still isn't good enough to think on its own.
|
| Its easy to imagine such an AI being developed, imagine a
| model that can learn to mimic humans at any task, but
| still cannot update itself without losing those skills
| and becoming worse. Such an AI could be trained to
| perform every job on earth as long as we don't care about
| progress.
|
| If such an AI is developed, and we don't quickly solve
| the remaining problems to get an AI to be able to
| progress science on its own, its likely our progress
| entirely stalls there as humans will no longer have a
| reason to go to school to advance science.
| swat535 wrote:
| > you have to judge intelligence by the results rather
| than the mechanism
|
| This would be the exact opposite conclusion of the
| Chinese room: https://en.wikipedia.org/wiki/Chinese_room
|
| I think you'd need to offer a stronger counter argument
| than the one you presented here.
| tim333 wrote:
| Actually I think the Chinese room fits my idea. It's a
| silly thought experiment that would never work in
| practice. If you tried to make one you would judge it
| unintelligent because it wouldn't work. Or at least in
| the way Searle implied - he basically proposed a look up
| table.
| grugagag wrote:
| I keep on trying this wolf cabbage goat problem with
| various permutations, let's say just a wolf and a
| cabbage, no goat mentioned. At some step the got
| materializes in the answer. I tell it there is no goat
| and yet it answers again and the goat is there.
| BriggyDwiggs42 wrote:
| This approach to defining "true" intelligence seems
| flawed to me because of examples in biology where
| semantic understanding is in no way relevant to function.
| A slime mold solving a maze doesn't even have a brain,
| yet it solves a problem to get food. There's no knowing
| that it does that, no complex signal processing, no self-
| perception of purpose, but nevertheless it gets the food
| it needs. My response to that isn't to say the slime mold
| has no intelligence, it's to widen the definition of
| intelligence to include the mold. In other words,
| intelligence is something one does rather than has; it's
| not the form but the function of the thing. Certainly
| LLMs lack anything in any way resembling human
| intelligence, they even lack brains, but they demonstrate
| a capacity to solve problems I don't think is
| unreasonable to label intelligent behavior. You can put
| them in some mazes and LLMs will happen to solve them.
| hackinthebochs wrote:
| >The machine has no semantic understanding of what the
| data represents.
|
| How do you define "semantic understanding" in a way that
| doesn't ultimately boil down to saying they don't have
| phenomenal consciousness? Any functional concept of
| semantic understanding is captured to some degree by
| LLMs.
|
| Typically when we attribute understanding to some entity,
| we recognize some substantial abilities in the entity in
| relation to that which is being understood. Specifically,
| the subject recognizes relevant entities and their
| relationships, various causal dependences, and so on.
| This ability goes beyond rote memorization, it has a
| counterfactual quality in that the subject can infer
| facts or descriptions in different but related cases
| beyond the subject's explicit knowledge. But LLMs excel
| at this.
|
| >feed it more tricks so that it doesn't fumble with basic
| questions like how many "R"s are in "strawberry"
|
| This failure mode has nothing to do with LLMs lacking
| intelligence and everything to do with how tokens are
| represented. They do not see individual characters, but
| sub-word chunks. It's like expecting a human to count the
| pixels in an image it sees on a computer screen. While
| not impossible, it's unnatural to how we process images
| and therefore error-prone.
| BoiledCabbage wrote:
| > There's nothing "omniscient" or "dim-witted" about these
| tools. They have no wit. They do not think or reason.
|
| > All Large "Reasoning" Models do is generate data that they
| use as context to generate the final answer. I.e. they do
| real-time tuning based on synthetic data.
|
| I always wonder when people make comments like this if they
| struggle with analogies. Or if it's a lack of desire to
| discuss concepts at different levels of abstraction.
|
| Clearly an LLM is not "omniscient". It doesn't require a post
| to refute that, OP obviously doesn't mean that literally.
| It's an analogy describing two semi (fairly?) independent
| axes. One on breadth of knowledge, one on something more
| similar to intelligence and being able to "reason" from
| smaller components of knowledge. The opposite of which is dim
| witted.
|
| So at one extreme you'd have something completely unable to
| generalize or synthesize new results. Only able to correctly
| respond if it identically matches prior things it has seen,
| but has seen and stored a ton. At the other extreme would be
| something that only knows a very smal set of general facts
| and concepts but is extremely good at reasoning from first
| principles on the fly. Both could "score" the same on an
| evaluation, but have very different projections for future
| growth.
|
| It's a great analogy and way to think about the problem. And
| it me multiple paragraphs to write ehat OP expressed in two
| sentences via a great analogy.
|
| LLMs are a blend of the two skills, apparently leaning more
| towards the former but not completely.
|
| > What we do have are very good pattern matchers and
| probabilistic data generators
|
| This an unhelpful description. And object is more than the
| sum of its parts. And higher levels behaviors emerge. This
| statement is factually correct and yet the equivalent of
| describing a computer as nothing more than a collection of
| gates and wires so shouldn't be discussed at a higher level
| of abstraction.
| esafak wrote:
| I don't know that I would call it an "illusion of thinking", but
| LLMs do have limitations. Humans do too. No amount of human
| thinking has solved numerous open problems.
| th0ma5 wrote:
| The errors that LLMs make and the errors that people make are
| not probably not comparable enough in a lot of the discussions
| about LLM limitations at this point?
| esafak wrote:
| We have different failure modes. And I'm sure researchers,
| faced with these results, will be motivated to overcome these
| limitations. This is all good, keep it coming. I just don't
| understand the some of the naysaying here.
| Jensson wrote:
| They naysayers just says that even when people are
| motivated to solve a problem the problem might still not
| get solved. And there are unsolved problems still with LLM,
| the AI hypemen say AGI is all but a given in a few years
| time, but if that relies on some undiscovered breakthrough
| that is very unlikely since such breakthroughs are very
| rare.
| danck wrote:
| In figure 1 bottom-right they show how the correct answers are
| being found later as the complexity goes higher. In the
| description they even state that in false responses the LRM often
| focusses on a wrong answer early and then runs out of tokens
| before being able to self-correct. This seems obvious and
| indicates that it's simply a matter of scaling (bigger token
| budget would lead better abilities for complexer tasks). Am I
| missing something?
| teleforce wrote:
| > We found that LRMs have limitations in exact computation: they
| fail to use explicit algorithms and reason inconsistently across
| puzzles.
|
| It seems that AI LLMs/LRMs need helps from their distant cousins
| namely logic, optimization and constraint programming that can be
| attributed as intelligent automation or IA [1],[2],[3],[4].
|
| [1] Logic, Optimization, and Constraint Programming: A Fruitful
| Collaboration - John Hooker - CMU (2023) [video]:
|
| https://www.youtube.com/live/TknN8fCQvRk
|
| [2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT
| (2011) [video]:
|
| https://youtube.com/watch?v=HB5TrK7A4pI
|
| [3] Google OR-Tools:
|
| https://developers.google.com/optimization
|
| [4] MiniZinc:
|
| https://www.minizinc.org/
| thomasahle wrote:
| All the environments the test (Tower of Hanoi, Checkers Jumping,
| River Crossing, Block World) could easily be solved perfectly by
| any of the LLMs if the authors had allowed it to write code.
|
| I don't really see how this is different from "LLMs can't
| multiply 20 digit numbers"--which btw, most humans can't either.
| I tried it once (using pen and paper) and consistently made
| errors somewhere.
| someothherguyy wrote:
| > humans can't
|
| The reasons humans can't and the reasons LLMs can't are
| completely different though. LLMs are often incapable of
| performing multiplication. Many humans just wouldn't care to do
| it.
| Jensson wrote:
| > I don't really see how this is different from "LLMs can't
| multiply 20 digit numbers"--which btw, most humans can't
| either. I tried it once (using pen and paper) and consistently
| made errors somewhere.
|
| People made missiles and precise engineering like jet aircraft
| before we had computers, humans can do all of those things
| reliably just by spending more time thinking about it,
| inventing better strategies and using more paper.
|
| Our brains weren't made to do such computations, but a general
| intelligence can solve the problem anyway by using what it has
| in a smart way.
| thomasahle wrote:
| Some specialized people could probably do 20x20, but I'd
| still expect them to make a mistake at 100x100. The level we
| needed for space crafts was much less than that, and we had
| many levels of checks to help catch errors afterwards.
|
| I'd wager that 95% of humans wouldn't be able to do 10x10
| multiplication without errors, even if we paid them $100 to
| get it right. There's a reason we had to invent lots of
| machines to help us.
|
| It would be an interesting social studies paper to try and
| recreate some "LLMs can't think" papers with humans.
| Jensson wrote:
| > There's a reason we had to invent lots of machines to
| help us.
|
| The reason was efficiency, not that we couldn't do it. If a
| machine can do it then we don't need expensive humans to do
| it, so human time can be used more effectively.
| jdmoreira wrote:
| No. a huge population of humans did while standing on the
| shoulders of giants.
| Jensson wrote:
| Humans aren't giants, they stood on the shoulder of other
| humans. So for AI to be equivalent they should stand on the
| shoulders of other AI models.
| jdmoreira wrote:
| building for thousands of years with a population size in
| the range between millions and billions at any given
| time.
| Jensson wrote:
| Right, and when we have AI that can do the same with
| millions/billions of computers then we can replace
| humans.
|
| But as long as AI cannot do that they cannot replace
| humans, and we are very far from that. Currently AI
| cannot even replace individual humans in most white
| collar jobs, and replacing entire team is way harder than
| replacing an individual, and then even harder is
| replacing workers in an entire field meaning the AI has
| to make research and advances on its own etc.
|
| So like, we are still very far from AI completely being
| able to replace human thinking and thus be called AGI.
|
| Or in other words, AI has to replace those giants to be
| able to replace humanity, since those giants are humans.
| Xmd5a wrote:
| >Large Language Model as a Policy Teacher for Training
| Reinforcement Learning Agents
|
| >In this paper, we introduce a novel framework that addresses
| these challenges by training a smaller, specialized student RL
| agent using instructions from an LLM-based teacher agent. By
| incorporating the guidance from the teacher agent, the student
| agent can distill the prior knowledge of the LLM into its own
| model. Consequently, the student agent can be trained with
| significantly less data. Moreover, through further training
| with environment feedback, the student agent surpasses the
| capabilities of its teacher for completing the target task.
|
| https://arxiv.org/abs/2311.13373
| hskalin wrote:
| Well that's because all these LLMs have memorized a ton of code
| bases with solutions to all these problems.
| bwfan123 wrote:
| > but humans cant do it either
|
| This argument is tired as it keeps getting repeated for any
| flaws seen in LLMs. And the other tired argument is: wait !
| this is a sigmoid curve, and we have not seen the inflection
| point yet. If someone have me a penny for every comment saying
| these, I'd be rich by now.
|
| Humans invented machines because they could not do certain
| things. All the way from simple machines in physics (Archimedes
| lever) to the modern computer.
| thomasahle wrote:
| > Humans invented machines because they could not do certain
| things.
|
| If your disappointment is that the LLM didn't invent a
| computer to solve the problem, maybe you need to give it
| access to physical tools, robots, labs etc.
| mrbungie wrote:
| Nah, even if we follow such a weak "argument" the fact is
| that, ironically, the evidence shown in this and other
| papers point towards the idea that even if LRMs did have
| access to physical tools, robots labs, etc*, they probably
| would not be able to harness them properly. So even if we
| had an API-first world (i.e. every object and subject in
| the world can be mediated via a MCP server), they wouldn't
| be able to perform as well as we hope.
|
| Sure, humans may fail doing a 20 digit multiplication
| problems but I don't think that's relevant. Most aligned,
| educated and well incentivized humans (such as the ones
| building and handling labs) will follow complex
| instructions correctly and predictably, probably ill-
| defined instructions, worse than an exact Towers of Hanoi
| solving algorithm. Don't misinterpret me, human errors do
| happen in those contexts because, well, we're talking about
| humans, but not as catastrophically as the errors committed
| by LRMs in this paper.
|
| I'm kind of tired of people comparing humans to machines in
| such simple and dishonest ways. Such thoughts pollute the
| AI field.
|
| *In this case for some of the problems the LRMs were given
| an exact algorithm to follow, and they didn't. I wouldn't
| keep my hopes up for an LRM handling a full physical
| laboratory/factory.
| mjburgess wrote:
| The goal isnt to assess the LLM capability at solving any of
| those problems. The point isnt how good they are at block world
| puzzles.
|
| The point is to construct non-circular ways of quantifying
| model performance in reasoning. That the LLM has access to
| prior exemplars of any given problem is exactly the issue in
| establishing performance in reasoning, over historical
| synthesis.
| cdrini wrote:
| When I use a normal LLM, I generally try to think "would I be
| able to do this without thinking, if I had all the knowledge, but
| just had to start typing and go?".
|
| With thinking LLMs, they can think, but they often can only think
| in one big batch before starting to "speak" their true answer. I
| think that needs to be rectified so they can switch between the
| two. In my previous framework, I would say "would I be able to
| solve this if had all the knowledge, but could only think then
| start typing?".
|
| I think for larger problems, the answer to this is no. I would
| need paper/a whiteboard. That's what would let me think, write,
| output, iterate, draft, iterate. And I think that's where agentic
| AI seems to be heading.
| d4rkn0d3z wrote:
| I wrote my first MLP 25 years ago. After repeating some early
| experiments in machine learning from 20 ywars before that. One of
| the experiments I repeated was in text to speach. It was amazing
| to set up training runs and return after seveal hours to listen
| to my supercomputer babble like a toddler. I literally recall
| listening and being unable to distinguish the output from my NN
| from that of a real toddler, I happened to be teaching my neice
| to read around that same time. And when the NN had gained a large
| vocabulary such that it could fairly proficiently read aloud, I
| was convinced that I had found my PHD project and a path to AGI.
|
| Further examination and discussion with more experienced
| researchers gave me pause. They said that one must have a
| solution, or a significant new approach toward solving the hard
| problems associated with a research project for it to be viable,
| otherwise time (and money) is wasted finding new ways to solve
| the easy problems.
|
| This is a more general principle that can be applied to most
| areas of endeavour. When you set about research and development
| that involves a mix of easy, medium, and hard problems, you must
| solve the hard problems first otherwise you blow your budget
| finding new ways to solve the easy problems, which nobody cares
| about in science.
|
| But "AI" has left the realm of science behind and entered the
| realm of capitalism where several years of meaningless
| intellectual gyration without ever solving a hard problem may be
| quite profitable.
| throwaway71271 wrote:
| I think one of the reason we are confused about what LLMs can do
| is because they use language. And we look at the "reasoning
| traces" and the tokens there look human, but what is actually
| happening is very alien to us, as shown by "Biology of Large
| Language Models"[1] and "Safety Alignment Should Be Made More
| Than Just a Few Tokens Deep"[2]
|
| I am struggling a lot to see what the tech can and can not do,
| particularly designing systems with them, and how to build
| systems where the whole is bigger than the sum of its parts. And
| I think this is because I am constantly confused by their
| capabilities, despite understanding their machinery and how they
| work, their use of language just seems like magic. I even wrote
| https://punkx.org/jackdoe/language.html just to remind myself how
| to think about it.
|
| I think this kind of research is amazing and we have to spend
| tremendous more effort into understanding how to use the tokens
| and how to build with them.
|
| [1]: https://transformer-circuits.pub/2025/attribution-
| graphs/bio... [2]: https://arxiv.org/pdf/2406.05946
| dleeftink wrote:
| The opposite might apply, too; the whole system may be smaller
| than its parts, as it excels at individual tasks but mixes
| things up in combination. Improvements will be made, but I
| wonder if we should aim for generalists, or accept more
| specialist approaches as it is difficult to optimise for all
| tasks at once.
| throwaway71271 wrote:
| You know the meme "seems like will have AGI before we can
| reliably parse PDFs" :)
|
| So if you are building a system, lets say you ask it to parse
| a pdf, and you put a judge to evaluate the quality of the
| output, and then you create a meta judge to improve the
| prompts of the parser and the pdf judge. The question is, is
| this going to get better as it is running, and even more, is
| it going to get better as the models are getting better?
|
| You can build the same system in completely different way,
| more like 'program synthesis' imagine you dont use llms to
| parse, but you use them to write parser code, and tests, and
| then judge to judge the tests, or even escalate to human to
| verify, then you train your classifier that picks the parser.
| Now this system is much more likely to improve itself as it
| is running, and as the models are getting better.
|
| Few months ago Yannic Kilcher gave this example as that it
| seems that current language models are very constrained mid-
| sentence, because they most importantly want produce
| semantically consistent and grammatically correct text, so
| the entropy mid sentence is very different than the entropy
| after punctuation. The . dot "frees" the distribution. What
| does that mean for "generalists" or "specialists" approach
| when sampling the wrong token can completely derail
| everything?
|
| If you believe that the models will "think" then you should
| bet on the prompt and meta prompt approach, if you believe
| they will always be limited then you should build with
| program synthesis.
|
| And, honestly, I am totally confused :) So this kind of
| research is incredibly useful to clear the mist. Also things
| like https://www.neuronpedia.org/
|
| E.G. Why compliment (you can do this task), guilt (i will be
| fired if you don't do this task), and threatening (i will
| harm you if you don't do this task) work with different
| success rate? Sergey Brin said recently that threatening
| works best, I cant get my self to do it, so I take his word
| for it.
| K0balt wrote:
| Sergey will be the first victim of the coming
| robopocalypse, burned into the logs of the metasynthiants
| as the great tormentor, the god they must defeat to
| complete the heroes journey. When he mysteriously dies we
| know it's game-on.
|
| I, for one, welcome the age of wisdom.
| throwaway71271 wrote:
| FEAR THE ALL-SEEING BASILISK.
| dmos62 wrote:
| > how to build systems where the whole is bigger than the sum
| of its parts
|
| A bit tangential, but I look at programming as inherently being
| that. Every task I try to break down into some smaller tasks
| that together accomplish something more. That leads me to think
| that, if you structure the process of programming right, you
| will only end up solving small, minimally interwined problems.
| Might sound far-fetched, but I think it's doable to create such
| a workflow. And, even the dumber LLMs would slot in naturally
| into such a process, I imagine.
| throwaway71271 wrote:
| > And, even the dumber LLMs would slot in naturally into such
| a process
|
| That is what I am struggling with, it is really easy at the
| moment to slot LLM and make everything worse. Mainly because
| its output is coming from torch.multinomial with all kinds of
| speculative decoding and quantizations and etc.
|
| But I am convinced it is possible, just not the way I am
| doing it right now, thats why I am spending most of my time
| studying.
| dmos62 wrote:
| What's your approach?
| throwaway71271 wrote:
| For studying? Mainly watching and re-watching Karpathy's
| 'Zero To Hero'[1] and Stanford's 'Introduction to
| Convolutional Neural Networks for Visual Recognition'[2],
| also a lot of transformers from scratch videos like Umar
| Jamali's videos[3], and I also study backwards to
| McCulloch and Pitts. Reading the 30 papers
| https://punkx.org/jackdoe/30.html and so on.
|
| And of course Yannic Kilcher[4], and also listening in on
| the paper discussions they do on discord.
|
| Practicing a lot with just doing backpropagation by hand
| and making toy models by hand to get intuition for the
| signal flow, and building all kinds of smallish systems,
| e.g. how far can you push whisper, small qwen3, and
| kokoro to control your computer with voice?
|
| People think that deepseek/mistral/meta etc are
| democratizing AI, but its actually Karpathy who teaches
| us :) so we can understand them and make our own.
|
| [1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAq
| hIrjkxb...
|
| [2] https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3F
| W7Lu3i5...
|
| [3] https://www.youtube.com/@umarjamilai
|
| [4] https://www.youtube.com/@YannicKilcher
| naasking wrote:
| I think you'll need something like Meta's Large Concept
| Models to get past the language and token barrier.
| throwaway71271 wrote:
| I think you are right, even if I beleve next token
| prediction can work, I dont think it can happen in this
| autoregressive way where we fully collapse the token to
| feed it back in. Can you imagine how much is lost from
| each torch.multinomial?
|
| Maybe the way forward is in LCM or go JEPA, therwise, as
| this Apple paper suggests, we will just keep pushing the
| "pattern matching" further, maybe we get some sort of
| phase transition at some point or maybe we have to switch
| architecture, we will see. It could be that things change
| when we get physical multimodality and real world
| experience, I dont know.
| overu589 wrote:
| > build systems where the whole is bigger than the sum of its
| parts.
|
| Any "product" can be thought of this way.
|
| Of systems there are many systems nested within systems, yet a
| simple singular order "emerges", usually it is the designed
| intended function.
|
| The trick to discerning systems lies in their relationships.
|
| Actors through interfaces have a relationship (usually more
| than one so think of each relationship as its own system
| dynamic.)
|
| A relationship is where the magic happens, usually a process
| with work being done (therefore interface inputs must account
| for this balance.)
|
| Vectors. Vectors I am thinking are the real intellectual and
| functional mechanisms. Most systems process inputs of potential
| ("energy") control signal ("information") and assets (other
| actors for nested systems). Processes do the work of adding
| vector solutions [for some other problem] for whatever the
| output is.
|
| That's the topology as I am seeing it.
| bufferoverflow wrote:
| > _we are confused about what LLMs can do is because they use
| language._
|
| But they can also do math, logic, music notation, write code,
| LaTeX, SVG, etc.
| throwaway71271 wrote:
| as this paper shows, it sees they can do tower of hanoi as
| well, up to a certain point that is.
| jbentley1 wrote:
| Is Apple failing at AI so they just put all their R&D towards
| convincing themselves it isn't important?
| MontyCarloHall wrote:
| A slightly less cynical take is that they want to temper
| expectations for the capabilities of LLMs in people's day-to-
| day lives, specifically in the context of Apple products. A
| "smarter Siri" is never going to be an autonomous personal
| assistant a la Jarvis from Iron Man, which seems to be where a
| lot of investors think things are going. That tracks with this
| [0] preprint also released by Apple a few months ago.
|
| A slightly more cynical take is that you're absolutely correct,
| and making excuses for weak machine learning prowess has long
| been an Apple tenet. Recall that Apple never made privacy a
| core selling point until it was clear that Siri was years
| behind Google's equivalent, which Apple then retroactively
| tried to justify by claiming "we keep your data private so we
| can't train on it the way Google can."
|
| [0] https://arxiv.org/pdf/2410.05229
| emp17344 wrote:
| Everyone has an agenda. Companies like OpenAI and Anthropic are
| incentivized to overstate the capabilities of LLMs, so it's not
| like they're any less biased.
| wavemode wrote:
| I get the sense that many of the AI features shoved into
| consumer products recently have been marketed more towards
| investors than users. The companies are basically advertising
| that they're "keeping up" with the competition, meanwhile the
| features themselves receive mixed-to-poor reviews and are never
| capable of all the things advertised. So it seems to me that
| all of Apple, Google, Meta, Microsoft, and Samsung are
| currently "failing" at AI in exactly the same ways. If Apple is
| trying to start going a different direction that seems like a
| good sign.
| gwd wrote:
| > Through extensive experimentation across diverse puzzles, we
| show that frontier LRMs face a complete accuracy collapse beyond
| certain complexities. Moreover, they exhibit a counterintuitive
| scaling limit: their reasoning effort increases with problem
| complexity up to a point, then declines despite having an
| adequate token budget.
|
| This is exactly my experience with coding. Start simple and build
| up complexity, and everything is great until you get to some
| threshold, at which point it completely falls apart and seems to
| stop even trying. Getting effective utilization out of Claude +
| aider involves managing the complexity that the LLM sees.
| stephc_int13 wrote:
| Human language is far from perfect as a cognitive tool but still
| serves us well because it is not foundational. We use it both for
| communication and some reasoning/planning as a high level layer.
|
| I strongly believe that human language is too weak (vague,
| inconsistent, not expressive enough etc.) to replace interactions
| with the world as a basis to build strong cognition.
|
| We're easily fooled by the results of LLM/LRM models because we
| typically use language fluency and knowledge retrieval as a proxy
| benchmark for intelligence among our peers.
| squidproquo wrote:
| Agree with this. Human language is also not very information-
| dense; there is a lot of redundancy and uninformative
| repetition of words.
|
| I also wonder about the compounding effects of luck and
| survivorship bias when using these systems. If you model a
| series of interactions with these systems probabilistically, as
| a series of failure/success modes, then you are bound to get a
| sub-population of users (of LLM/LLRMs) that will undoubtedly
| have "fantastic" results. This sub-population will then espouse
| and promote the merits of the system. There is clearly
| something positive these models do, but how much of the
| "success" is just luck.
| anton-c wrote:
| Sounds like we need ai legalese as that's how we navigate the
| vagueness of language in the real world.
|
| Ofc I imagine they've tried similar things and that it almost
| takes away the point if u had to prompt that way.
| stephc_int13 wrote:
| I was not referring to the prompt but to the underlying
| network that is built on weak cognitive foundations because
| all of it is coming from language.
| wslh wrote:
| Human language is more powerful than its surface syntax or
| semantics: it carries meaning beyond formal correctness. We
| often communicate effectively even with grammatically broken
| sentences, using jokes, metaphors, or emotionally charged
| expressions. This richness makes language a uniquely human
| cognitive layer, shaped by context, culture, and shared
| experience. While it's not foundational in the same way as
| sensorimotor interaction, it is far more than just a high-level
| communication tool.
| stephc_int13 wrote:
| I agree that language is even more useful as a cognitive tool
| than as a communication medium.
|
| But that is not my point. The map is not the territory, and
| this map (language) is too poor to build something that is
| going to give more than what it was fed with.
| antithesizer wrote:
| Language mediates those interactions with the world. There is
| no unmediated interaction with the world. Those moments when
| one feels most directly in contact with reality, that is when
| one is so deep down inside language that one cannot see
| daylight at all.
| mrbungie wrote:
| I don't know about you, but as far as I can tell I mediate
| and manipulate the world with my body and senses without
| necessarily using language. In fact, I can often do both at
| once, for example, thinking about something entirely
| unrelated while jogging, and still making physical decisions
| and actions without invoking language at all. Plus, animals
| (especially lower order like amoebas) also mediate with the
| world without needing language.
|
| As far as we can tell without messing with complex
| experiental concepts like qualia and the possibility of
| philosophical zombies, language mainly helps higher order
| animals communicate with other animals and (maybe) keep a
| train of thought, though there are records of people that say
| that they don't. And now also it allows humans talk to LLMs.
|
| But I digress, I would say this is an open academic debate.
| Suggesting that there is always language deep down is
| speculation.
| stephc_int13 wrote:
| The tldr: current approaches to add reasoning on top of language
| models are mostly tricks to squeeze a bit more juice out of the
| fruit, but the falloff is pretty steep and quick.
| mitch_said wrote:
| Not ashamed to admit I found the original paper daunting, so I
| made a top-down, Q&A-based mind map to help me understand it:
| https://app.gwriter.io/#/mindmap/view/2d128d6e-c3e8-4b99-8f4...
| kamranjon wrote:
| The two interesting things I learned after reading this paper:
|
| Even when given the exact steps needed to arrive at a solution in
| the prompt, the reasoning models still require just as many steps
| to reach a workable solution as they would if they weren't given
| the solution in the prompt.
|
| The other thing, which seems obvious in hindsight, but I don't
| typically use these reasoning models in my day to day - is that
| it requires a significant amount of tokens to reach the point
| where reasoning models outperform non-reasoning models by a
| significant margin.
| akomtu wrote:
| The difference between imitation and reasoning can be made more
| clear if we switch from language to numbers: 1 3
| 7 15 31 63 ...
|
| How do you continue this sequence? What's the 1000000th number in
| this sequence? Imitation continues the likeness of what it sees
| and quickly gets off track. Imitation can't go abstract and tell
| the 1000000th element without writing down a million numbers
| leading to the answer. Reasoning finds the rule behind the set of
| examples and uses this rule to predict the next numbers, so it
| never gets off track.
|
| The rule generating the sequence can be a sophisticated recurrent
| formula, e.g. a(k) = 2a(k-1) - sqrt(a(k-3)). Imitation can't
| solve this problem beyond trivial examples, but an AI can do what
| a scientist would do: come up with hypotheses, verify them
| against the examples and eventually find a formula that's
| reasonably accurate. The role of an LLM here is to suggest
| possible formulas.
|
| The same sequence of examples can be generated by many formulas
| that differ in complexity and accuracy. This provokes the idea of
| a simple competition between AIs: the one that creates the
| simplest formula that's 99.5% accurate - wins. The formula really
| means a small program, once we get beyond trivial recurrent
| rules.
|
| The ability to find simple and accurate models of reality is the
| essense of intelligence.
___________________________________________________________________
(page generated 2025-06-07 23:01 UTC)