[HN Gopher] Understanding Reasoning LLMs
       ___________________________________________________________________
        
       Understanding Reasoning LLMs
        
       Author : sebg
       Score  : 422 points
       Date   : 2025-02-06 21:34 UTC (1 days ago)
        
 (HTM) web link (magazine.sebastianraschka.com)
 (TXT) w3m dump (magazine.sebastianraschka.com)
        
       | behnamoh wrote:
       | doesn't it seem like these models are getting to the point where
       | even conceiving their training and development is less and less
       | possible for the general public?
       | 
       | I mean, we already knew only a handful of companies with capital
       | could train them, but at least the principles, algorithms, etc.
       | were accessible to individuals who wanted to create their own -
       | much simpler - models.
       | 
       | it seems that era is quickly ending, and we are entering the era
       | of truly "magic" AI models that no one knows how they work
       | because companies keep their secret sauces...
        
         | HarHarVeryFunny wrote:
         | I don't think it's realistic to expect to have access to the
         | same training data as the big labs that are paying people to
         | generate it for them, but hopefully there will be open source
         | ones that are still decent.
         | 
         | At the end of the day current O1-like reasoning models are
         | still just fine-tuned LLMs, and don't even need RL if you have
         | access to (or can generate) a suitable training set. The
         | DeepSeek R1 paper outlined their bootstrapping process, and
         | HuggingFace (and no doubt others) are trying to duplicate it.
        
         | antirez wrote:
         | In recent weeks what's happening is exactly the contrary.
        
         | fspeech wrote:
         | Recent developments like V3, R1 and S1 are actually clarifying
         | and pointing towards more understandable, efficient and
         | therefore more accessible models.
        
         | tmnvdb wrote:
         | We have been in the 'magic scaling' era for a while now. While
         | the basic architecture of language models is reasonably simple
         | and well understood, the emergent effects of making models
         | bigger are largely magic even to the researchers, only to be
         | studied emperically after the fact.
        
       | dr_dshiv wrote:
       | How important is it that the reasoning takes place in another
       | thread versus just chain-of-thought in the same thread? I feel
       | like it makes a difference, but I have no evidence.
        
       | vector_spaces wrote:
       | Is there any work being done in training LLMs on more restricted
       | formal languages? Something like a constraint solver or automated
       | theorem prover, but much lower level. Specifically something that
       | isn't natural language. That's the only path I could see towards
       | reasoning models being truly effective
       | 
       | I know there is work being done with e.g. Lean integration with
       | ChatGPT, but that's not what I mean exactly -- there's still this
       | shakey natural-language-trained-LLM glue in the driver's seat
       | 
       | Like I'm envisioning something that has the creativity to try
       | different things, but then JIT compile their chain of thought,
       | and avoid bad paths
        
         | mindwok wrote:
         | How would that be different from something like ChatGPT
         | executing Lean? That's exactly what humans do, we have messy
         | reasoning that we then write down in formal logic and compile
         | to see if it holds.
        
         | gsam wrote:
         | In my mind, the pure reinforcement learning approach of
         | DeepSeek is the most practical way to do this. Essentially it
         | needs to continually refine and find more sound(?) subspaces of
         | the latent (embedding) space. Now this could be the subspace
         | which is just Python code (or some other human-invented
         | subspace), but I don't think that would be optimal for the
         | overall architecture.
         | 
         | The reason why it seems the most reasonable path is because
         | when you create restrictions like this you hamper search
         | viability (and in a high multi-dimensional subspace, that's a
         | massive loss because you can arrive at a result from many
         | directions). It's like regular genetic programming vs typed-
         | genetic programming. When you discard all your useful results,
         | you can't go anywhere near as fast. There will be a threshold
         | where constructivist, generative schemes (e.g. reasoning with
         | automata and all kinds of fun we've neglected) will be the way
         | forward, but I don't think we've hit that point yet. It seems
         | to me that such a point does exist because if you have fast
         | heuristics on when types unify, you no longer hamper the search
         | speed but gain many benefits in soundness.
         | 
         | One of the greatest human achievements of all time is probably
         | this latent embedding space -- one that we can actually
         | interface with. It's a new lingua franca.
         | 
         | These are just my cloudy current thoughts.
        
           | danielmarkbruce wrote:
           | fwiw, most people don't _really_ grok the power of latent
           | space wrt language models. Like, you say it, I believe it,
           | but most people don 't really grasp it.
        
             | ttul wrote:
             | Image generation models also have an insanely rich latent
             | space. People will be squeezing value out of SDXL for many
             | years to come.
        
           | HarHarVeryFunny wrote:
           | DeepSeek's approach with R1 wasn't pure RL - they used RL
           | only to develop R0 from their V3 base model, but then went
           | though two iterations of using current model to generate
           | synthetic reasoning data, SFT on that, then RL fine-tuning,
           | and repeat.
        
         | truculent wrote:
         | I think something like structured generation might work in this
         | context
        
         | colonial wrote:
         | If I understand your idea correctly, I don't think a "pure" LLM
         | would derive much advantage from this. Sure, you can constrain
         | them to generate something _syntactically_ valid, but there 's
         | no way to make them generate something _semantically_ valid
         | 100% of the time. I 've seen frontier models muck up their
         | function calling JSON more than once.
         | 
         | As long as you're using something statistical like
         | transformers, you're going to need deterministic bolt-ons like
         | Lean.
        
           | nextaccountic wrote:
           | > there's no way to make them generate something semantically
           | valid 100% of the time.
           | 
           | You don't need to generate semantically valid reasoning 100%
           | of time for such an approach to be useful. You just need to
           | use semantic data to bias them to follow semantically valid
           | paths more often than not (and sometimes consider using
           | constraint solving on the spot, like offloading into a SMT
           | solver or even incorporating it in the model somehow; it
           | would be nice to have AI models that can combine the
           | strengths of both GPUs and CPUs). And, what's more useful,
           | verify that the reasoning is valid at the _end_ of the train
           | of thought, and if it is not, bail out and attempt something
           | else.
           | 
           | If you see AI as solving an optimization problem (given a
           | question, give a good answer) it's kind of evident that you
           | need to probe the space of ideas in an exploratory fashion,
           | sometimes making unfounded leaps (of the "it was revealed to
           | me in a dream" sort), and in this sense it could even be
           | useful that AI can sometimes hallucinate bullshit. But they
           | need afterwards to come with a good justification for the end
           | result, and if they can't find one they are forced to discard
           | their result (even if it's true). Just like humans often come
           | up with ideas in an irrational, subconscious way, and then
           | proceed to rationalize them. One way to implement this kind
           | of thing is to have the LLM generate code for a theorem
           | prover like Coq or Lean, and then at the end run the code -
           | if the prover rejects the code, the reasoning can't possibly
           | be right, and the AI needs to get back to the drawing board
           | 
           | (Now, if the prover accepts the code, the answer may still be
           | wrong, if the premises were encoded incorrectly - but it
           | would still be a net improvement, specially if people can
           | review the Coq code to spot mistakes)
        
           | soulofmischief wrote:
           | I wholeheartedly disagree. Logic is inherently statistical
           | due to the very nature of empirical sampling, which is the
           | only method we have for verification. We will eventually find
           | that it's classical, non-statistical logic which was the
           | (useful) approximation/hack, and that statistical reasoning
           | is a lot more "pure" and robust of an approach.
           | 
           | I went into a little more detail here last week:
           | https://news.ycombinator.com/item?id=42871894
           | 
           | > My personal insight is that "reasoning" is simply the
           | application of a probabilistic reasoning manifold on an input
           | in order to transform it into constrained output that serves
           | the stability or evolution of a system.
           | 
           | > This manifold is constructed via learning a
           | decontextualized pattern space on a given set of inputs.
           | Given the inherent probabilistic nature of sampling, true
           | reasoning is expressed in terms of probabilities, not axioms.
           | It may be possible to discover axioms by locating fixed
           | points or attractors on the manifold, but ultimately you're
           | looking at a probabilistic manifold constructed from your
           | input set.
           | 
           | I've been writing and working on this problem a lot over the
           | last few months and hopefully will have something more formal
           | and actionable to share eventually. Right now I'm at the,
           | "okay, this is evident and internally consistent, but what
           | can we actually _do_ with it that other techniques can 't
           | already accomplish?" phase that a lot of these metacognitive
           | theories get stuck on.
        
             | colonial wrote:
             | > Logic is inherently statistical due to the very nature of
             | empirical sampling, which is the only method we have for
             | verification.
             | 
             | What? I'm sorry, but this is ridiculous. You can make
             | plenty of sound logical arguments in an empirical vacuum.
             | This is why we have proof by induction - some things can't
             | be verified by taking samples.
        
               | soulofmischief wrote:
               | I'm speaking more about how we assess the relevance of a
               | logical system to the real world. Even if a system is
               | internally self-consistent, its utility depends on
               | whether its premises and conclusions align with what we
               | observe empirically. And because empirical observation is
               | inherently statistical due to sampling and measurement
               | limitations, the very act of verifying a logical system's
               | applicability to reality introduces a statistical
               | element. We just typically ignore this element because
               | some of these systems seem to hold up consistently enough
               | that we can take them for granted.
        
         | raincole wrote:
         | AlphaProof. Although I don't know if it's large enough to be
         | called an LLM.
         | 
         | https://deepmind.google/discover/blog/ai-solves-imo-problems...
        
         | Terr_ wrote:
         | I think that would be a fundamental mismatch. LLMs are
         | statistical and lossy and messy, which is what (paradoxically)
         | permits them to get surprisingly-decent results out of messy
         | problems that draw upon an enormous number and variety of messy
         | examples.
         | 
         | But for a rigorously structured language with formal fixed
         | meaning... Now the the LLM has no advantage anymore, only
         | serious drawbacks and limitations. Save yourself millions of
         | dollars and just write a normal parser, expression evaluator,
         | SAT solver, etc.
         | 
         | You'll get answers faster, using fewer resources, with fewer
         | fundamentally unfixable bugs, and it will actually be able to
         | do math.
        
       | prideout wrote:
       | This article has a superb diagram of the DeepSeek training
       | pipeline.
        
       | aithrowawaycomm wrote:
       | I like Raschka's writing, even if he is considerably more
       | optimistic about this tech than I am. But I think it's
       | inappropriate to claim that models like R1 are "good at deductive
       | or inductive reasoning" when that is demonstrably not true, they
       | are incapable of even the simplest "out-of-distribution"
       | deductive reasoning:
       | https://xcancel.com/JJitsev/status/1883158738661691878
       | 
       | They are certainly capable of doing is a wide variety of
       | computations that _simulate_ reasoning, and maybe that 's good
       | enough for your use case. But it is unpredictably brittle unless
       | you spend a lot on o1-pro (and even then...). Raschka has a line
       | about "whether and how an LLM actually 'thinks' is a separate
       | discussion" but this isn't about semantics. R1 clearly sucks at
       | deductive reasoning and you will not understand "reasoning" LLMs
       | if you take DeepSeek's claims at face value.
       | 
       | It seems especially incurious for him to copy-paste the "a-ha
       | moment" from Deepseek's technical report without critically
       | investigating it. DeepSeek's claims are unscientific, without
       | real evidence, and seem focused on hype and investment:
       | This moment is not only an "aha moment" for the model but also
       | for the researchers observing its behavior. It underscores the
       | power and beauty of reinforcement learning: rather than
       | explicitly teaching the model on how to solve a problem, we
       | simply provide it with the right incentives, and it autonomously
       | develops advanced problem-solving strategies.             The
       | "aha moment" serves as a powerful reminder of the potential of RL
       | to unlock new levels of intelligence in artificial systems,
       | paving the way for more autonomous and adaptive models in the
       | future.
       | 
       | Perhaps it was able to solve that tricky Olympiad problem, but
       | there are an infinite variety of 1st grade math problems it is
       | not able to solve. I doubt it's even reliably able to solve
       | simple variations of that root problem. Maybe it is! But it's
       | frustrating how little skepticism there is about CoT, reasoning
       | traces, etc.
        
         | scarmig wrote:
         | > But I think it's inappropriate to claim that models like R1
         | are "good at deductive or inductive reasoning" when that is
         | demonstrably not true, they are incapable of even the simplest
         | "out-of-distribution" deductive reasoning:
         | https://xcancel.com/JJitsev/status/1883158738661691878
         | 
         | Your link says that R1, not all models like R1, fails at
         | generalization.
         | 
         | Of particular note:
         | 
         | > We expose DeepSeek R1 to the variations of AIW Friends
         | problem and compare model behavior to o1-preview, o1-mini and
         | Claude 3.5 Sonnet. o1-preview handles the problem robustly,
         | DeepSeek R1 shows strong fluctuations across variations with
         | distribution very similar to o1-mini.
        
           | Legend2440 wrote:
           | The way the authors talk about LLMs really rubs me the wrong
           | way. They spend more of the paper talking up the 'claims'
           | about LLMs that they are going to debunk than actually doing
           | any interesting study.
           | 
           | They came into this with the assumption that LLMs are just a
           | cheap trick. As a result, they deliberately searched for an
           | example of failure, rather than trying to do an honest
           | assessment of generalization capabilities.
        
             | suddenlybananas wrote:
             | >They came into this with the assumption that LLMs are just
             | a cheap trick. As a result, they deliberately searched for
             | an example of failure, rather than trying to do an honest
             | assessment of generalization capabilities.
             | 
             | And lo and behold, they still found a glaring failure. You
             | can't fault them for not buying into the hype.
        
               | Legend2440 wrote:
               | But it is still dishonest to declare reasoning LLMs a
               | scam simply because you searched for a failure mode.
               | 
               | If given a few hundred tries, I bet I could find an
               | example where you reason poorly too. Wikipedia has a
               | whole list of common failure modes of human reasoning:
               | https://en.wikipedia.org/wiki/List_of_fallacies
        
               | daveguy wrote:
               | Well, given the success rate is no more than 90% in the
               | best cases. You could probably find a failure in about 10
               | tries. The only exception is o1-preview. And this is just
               | a simple substitution of parameters.
        
             | o11c wrote:
             | What the hype crowd doesn't get is that for most people, "a
             | tool that randomly breaks" is not useful.
        
               | rixed wrote:
               | The fact that a tool can break or that the company
               | manufacturing that tool lies about its abilities, are
               | annoying but do not imply that the tool is useless.
               | 
               | I experience LLM "reasoning" failure several times a day,
               | yet I find them useful.
        
           | HarHarVeryFunny wrote:
           | I'd expect that OpenAI's stronger reasoning models also don't
           | generalize too far outside of the areas they are trained for.
           | At the end of the day these are still just LLMs, trying to
           | predict continuations, and how well they do is going to
           | depend on how well the problem at hand matches their training
           | data.
           | 
           | Perhaps the type of RL used to train them also has an effect
           | on generalization, but choice of training data has to play a
           | large part.
        
             | og_kalu wrote:
             | Nobody generalizes too far outside the areas they're
             | trained for. Probably that length, 'far' is shorter with
             | today's state of the art but the presence of failure modes
             | don't mean anything.
        
         | Legend2440 wrote:
         | >But I think it's inappropriate to claim that models like R1
         | are "good at deductive or inductive reasoning" when that is
         | demonstrably not true, they are incapable of even the simplest
         | "out-of-distribution" deductive reasoning:
         | 
         | That's not actually what your link says. The tweet says that it
         | solves the simple problem (that they originally designed to
         | foil base LLMs) so they had to invent harder problems until
         | they found one it could not reliably solve.
        
           | suddenlybananas wrote:
           | Did you see how similar the more complicated problem is? It's
           | nearly the exact same problem.
        
         | blovescoffee wrote:
         | The other day I fed a complicated engineering doc for an
         | architectural proposal at work into R1. I incorporated a few
         | great suggestions into my work. Then my work got reviewed very
         | positively by a large team of senior/staff+ engineers (most
         | with experience at FAANG; ie credibly solid engineers). R1 was
         | really useful! Sorry you don't like it but I think it's unfair
         | to say it sucks at reasoning.
        
           | martin-t wrote:
           | [flagged]
        
             | DiogenesKynikos wrote:
             | How do I know you're reasoning, and not just simulating
             | reasoning (imperfectly)?
        
             | dang wrote:
             | Please don't cross into personal attack and please don't
             | post in the flamewar style, regardless of how wrong someone
             | is or you feel they are. We're trying for the opposite
             | here.
             | 
             | https://news.ycombinator.com/newsguidelines.html
        
               | martin-t wrote:
               | The issue with this approach to moderation is that it
               | targets posts based on visibility of "undesired" behavior
               | instead of severity.
               | 
               | For example, many manipulative tactics (e.g. the fake
               | sorry here, responding to something else than was said,
               | ...) and lying can be considered insults (they literally
               | assume the reader is not smart enough to notice, hence at
               | least as severe as calling someone an idiot) but it's
               | hard for a mod to notice without putting in a lot of
               | effort to understand the situation.
               | 
               | Yet when people (very mildly) punish this behavior by
               | calling it out, they are often noticed by the mod because
               | the call out is more visible.
        
               | dang wrote:
               | I hear this argument a lot, but I think it's too
               | complicated. It doesn't explain any more than the simple
               | one does, and has the disadvantage of being self-serving.
               | 
               | The simple argument is that when you write things like
               | this:
               | 
               | > _I am unwilling to invest any more time into arguing
               | with someone unwilling to use reasoning_
               | 
               | ...you 're bluntly breaking the rules, regardless of what
               | another commenter is doing, be it subtly or blatantly
               | abusive.
               | 
               | I agree that there are countless varieties of passive-
               | aggressive swipe and they rub me the wrong way too, but
               | the argument that those are "just as bad, merely less
               | visible" is not accurate. Attacking someone else is not
               | justified by a passive-aggressive "sorry", just as it is
               | not ok to ram another vehicle when a driver cuts you off
               | in traffic.
        
         | UniverseHacker wrote:
         | > they are incapable of even the simplest "out-of-distribution"
         | deductive reasoning
         | 
         | But the link demonstrates the opposite- these models absolutely
         | are able to reason out of distribution, just not with perfect
         | fidelity. The fact that they can do better than random is
         | itself really impressive. And o1-preview does impressively
         | well, only vary rarely getting the wrong answer on variants of
         | that Alice in Wonderland problem.
         | 
         | If you would listen to most of the people critical of LLMs
         | saying they're a "stochastic parrot" - it should be impossible
         | for them to do better than random on any out of distribution
         | problem. Even just changing one number to create a novel math
         | problem should totally stump them and result in entirely random
         | outputs, but it does not.
         | 
         | Overall, poor reasoning that is better than random but
         | frequently gives the wrong answer is fundamentally,
         | categorically entirely different from being incapable of
         | reasoning.
        
           | danielmarkbruce wrote:
           | anyone saying an LLM is a stochastic parrot doesn't
           | understand them... they are just parroting what they heard.
        
             | bloomingkales wrote:
             | There is definitely a mini cult of people that want to be
             | very right about how everyone else is very wrong about AI.
        
               | danielmarkbruce wrote:
               | ie, the people that AI is dumb? Or you are saying I'm in
               | a cult for being pro it - I'm definitely part of that
               | cult - the "we already have agi and you have to contort
               | yourself into a pretzel to believe otherwise" cult. Not
               | sure if there is a leader though.
        
               | bloomingkales wrote:
               | I didn't realize my post can be interpreted either way.
               | I'll leave it ambiguous, hah. Place your bets I guess.
        
               | jamiek88 wrote:
               | You think we have AGI? What makes you think that?
        
               | danielmarkbruce wrote:
               | By knowing what each of the letters stand for
        
               | jamiek88 wrote:
               | Well that's disappointing. It was an extraordinary claim
               | that really interested me.
               | 
               | Thought I was about to be learn!
               | 
               | Instead, I just met an asshole.
        
               | danielmarkbruce wrote:
               | When someone says "i'm in the cult that believes X",
               | don't expect a water tight argument for the existence of
               | X.
        
               | mlinsey wrote:
               | There are a couple Twitter personalities that definitely
               | fit this description.
               | 
               | There is also a much bigger group of people that haven't
               | really tried anything beyond GPT-3.5, which was the best
               | you could get without paying a monthly subscription for a
               | long time. One of the biggest reasons for r1 hype,
               | besides the geopolitical angle, was people could actually
               | try a reasoning model for free for the first time.
        
               | ggm wrote:
               | Firstly this is meta ad hom. You're ignoring the argument
               | to target the speaker(s)
               | 
               | Secondly, you're ignoring the fact that the community of
               | voices with experience in data sciences, computer science
               | and artificial intelligence themselves are split on the
               | qualities or lack of them in current AI. GPT and LLM are
               | very interesting but say little or nothing to me of new
               | theory of mind, or display inductive logic and reasoning,
               | or even meet the bar for a philosophers cave solution to
               | problems. We've been here before so many, many times.
               | "Just a bit more power captain" was very strong in
               | connectionist theories of mind. fMRI brains activity
               | analytics, you name it.
               | 
               | So yes. There are a lot of "us" who are pushing back on
               | the hype, and no we're not a mini cult.
        
               | visarga wrote:
               | > GPT and LLM are very interesting but say little or
               | nothing to me of new theory of mind, or display inductive
               | logic and reasoning, or even meet the bar for a
               | philosophers cave solution to problems.
               | 
               | The simple fact they can generate language so well makes
               | me think... maybe language itself carries more weight
               | than we originally thought. LLMs can get to this point
               | without personal experience and embodiment, it should not
               | have been possible, but here we are.
               | 
               | I think philosophers are lagging science now. The RL
               | paradigm of agent-environment-reward based learning seems
               | to me a better one than what we have in philiosophy now.
               | And if you look at how LLMs model language as high
               | dimensional embedding spaces .. this could solve many
               | intractable philosophical problems, like the infinite
               | homunculus regress problem. Relational representations
               | straddle the midpoint between 1st and 3rd person,
               | offering a possible path over the hard problem "gap".
        
             | ggm wrote:
             | A good literary production. I would have been proud of it
             | had I thought of it, but it's a path to observe a strong
             | "whataboutery" element that if we use "stochastic parrot"
             | as shorthand and you dislike the term, now you understand
             | why we dislike the constant use of "infer", "reason" and
             | "hallucinate"
             | 
             | Parrots are self aware, complex reasoning brains which can
             | solve problems in geometry, tell lies, and act socially or
             | asocially. They also have complex vocal chords and can
             | perform mimicry. Very few aspects of a parrots behaviour
             | are stochastic but that also underplays how complex
             | stochastic systems can be in their production. If we label
             | LLM products as Stochastic Parrots it does not mean they
             | like cuttlefish bones or are demonstrably modelled by
             | Markov chains like Mark V Shaney.
        
               | gsam wrote:
               | I don't like wading into this debate when semantics are
               | very personal/subjective. But to me, it seems like almost
               | a sleight of hand to add the stochastic part, when
               | actually they're possibly weighted more on the parrot
               | part. Parrots are much more concrete, whereas the term
               | LLM could refer to the general architecture.
               | 
               | The question to me seems: If we expand on this
               | architecture (in some direction, compute, size etc.),
               | will we get something much more powerful? Whereas if you
               | give nature more time to iterate on the parrot, you'd
               | probably still end up with a parrot.
               | 
               | There's a giant impedance mismatch here (time scaling
               | being one). Unless people want to think of parrots being
               | a subset of all animals, and so 'stochastic animal' is
               | what they mean. But then it's really the difference of
               | 'stochastic human' and 'human'. And I don't think people
               | really want to face that particular distinction.
        
               | UniverseHacker wrote:
               | I'm sure both of you know this, but "stochastic parrot"
               | refers to the title of a research article that contained
               | a particular argument about LLM limitations that had very
               | little to do with parrots.
        
               | danielmarkbruce wrote:
               | The term is much more broadly known than the content of
               | that (rather silly) paper.... I'm not even certain that
               | it's the first use of the term.
        
               | ggm wrote:
               | https://books.google.com/ngrams/graph?content=Stochastic%
               | 2C+...
        
               | ggm wrote:
               | And the word "hallucination" ... has very little to do
               | with...
        
               | ggm wrote:
               | "Expand the architecture" .. "get something much more
               | powerful" .. "more dilithium crystals, captain"
               | 
               | Like I said elsewhere in this overall thread, we've been
               | here before. Yes, you do see improvements in larger
               | datasets, weighted models over more inputs. I suggest, I
               | guess I believe (to be more honest) that no amount of
               | "bigger" here will magically produce AGI simply because
               | of the scale effect.
               | 
               | There is no theory behind "more" and that means there is
               | no constructed sense of why, and the absence of abstract
               | inductive reasoning continues to say to me, this stuff
               | isn't making a qualitative leap into emergent anything.
               | 
               | It's just better at being an LLM. Even "show your working
               | " is pointing to complex causal chains, not actual
               | inductive reasoning as I see it.
        
               | gsam wrote:
               | And that's actually a really honest answer. Whereas
               | someone of the opposite opinion might be like parroting
               | in the general copying-template sense actually
               | generalizes to all observable behaviours because
               | templating systems can be turing-complete or something
               | like that. It's templates-all-the-way-down, including
               | complex induction as long as there is a meta-template to
               | match on its symptoms it can be chained on.
               | 
               | Induction is a hard problem, but humans can skip infinite
               | compute time (I don't think we have any reason to believe
               | humans have infinite compute) and still give valid
               | answers. Because there's some (meta)-structure to be
               | exploited.
               | 
               | Architecturally if machines / NN can exploit this same
               | structure is a truer question.
        
               | visarga wrote:
               | > this stuff isn't making a qualitative leap into
               | emergent anything.
               | 
               | The magical missing ingredient here is search. AlphaZero
               | used search to surpass humans, and the whole Alpha family
               | from DeepMind is surprisingly strong, but narrowly
               | targeted. The AlphaProof model uses LLMs and LEAN to
               | solve hard math problems. The same problem solving CoT
               | data is being used by current reasoning models and they
               | have much better results. The missing piece was search.
        
               | visarga wrote:
               | Well parrots can make more parrots, LLMs can't make their
               | own GPUs. So parrots win, but LLMs can interpolate and
               | even extrapolate a little, have you ever heard a parrot
               | do translation, hearing you say something in English and
               | translating it to Spanish? Yes, LLMs are not parrots.
               | Besides their debatable abilities, they work with human
               | in the loop, which means humans push them outside their
               | original distribution. That's not a parroting act, being
               | able to do more than pattern matching and reproduction.
        
               | danielmarkbruce wrote:
               | LLMs can easily order more GPUs over the internet, hire
               | people to build a datacenter and reproduce.
               | 
               | Or, more simply.. just hack into a bunch of aws accounts,
               | spin up machines, boom.
        
           | Jensson wrote:
           | > If you would listen to most of the people critical of LLMs
           | saying they're a "stochastic parrot" - it should be
           | impossible for them to do better than random on any out of
           | distribution problem. Even just changing one number to create
           | a novel math problem should totally stump them and result in
           | entirely random outputs, but it does not.
           | 
           | You don't seem to understand how they work, they recurse
           | their solution meaning if they have remembered components it
           | parrots back sub solutions. Its a bit like a natural language
           | computer, that way you can get them to do math etc, although
           | the instruction set isn't of a turing language.
           | 
           | They can't recurse sub sub parts they haven't seen, but
           | problems that has similar sub parts can of course be solved,
           | anyone understands that.
        
             | UniverseHacker wrote:
             | > You don't seem to understand how they work
             | 
             | I don't think anyone understands how they work- these type
             | of explanations aren't very complete or accurate. Such
             | explanations/models allow one to reason out what types of
             | things they should be capable of vs incapable of in
             | principle regardless of scale or algorithm tweaks, and
             | those predictions and arguments never match reality and
             | require constant goal post shifting as the models are
             | scaled up.
             | 
             | We understand how we brought them about via setting up an
             | optimization problem in a specific way, that isn't the same
             | at all as knowing how they work.
             | 
             | I tend to think in the totally abstract philosophical
             | sense, independent of the type of model, at the limit of an
             | increasingly capable function approximator trained on an
             | increasingly large and diverse set of real world
             | cause/effect time series data, you eventually develop and
             | increasingly accurate and general predictive model of
             | reality organically within the model. Some model types do
             | have fundamental limits in their ability to scale like
             | this, but we haven't yet found one with these models.
             | 
             | It is more appropriate to objectively test what they can
             | and cannot do, and avoid trying to infer what we expect
             | from how we think they work.
        
               | codr7 wrote:
               | Well we do know pretty much exactly what they do, don't
               | we?
               | 
               | What surprises us is the behaviors coming out of that
               | process.
               | 
               | But surprise isn't magic, magic shouldn't even be on the
               | list of explanations to consider.
        
               | layer8 wrote:
               | Magic wasn't mentioned here. We don't understand the
               | emerging behavior, in the sense that we can't reason well
               | about it and make good predictions about it (which would
               | allow us to better control and develop it).
               | 
               | This is similar to how understanding chemistry doesn't
               | imply understanding biology, or understanding how a brain
               | works.
        
               | codr7 wrote:
               | Exactly, we don't understand, but we want to believe it's
               | reasoning, which would be magic.
        
               | UniverseHacker wrote:
               | There's no belief or magic required, the word 'reasoning'
               | is used here to refer to an observed capability, not a
               | particular underlying process.
               | 
               | We also don't understand exactly how humans reason, so
               | any claims that humans are capable of reasoning is also
               | mostly an observation about abilities/capabilities.
        
               | jakefromstatecs wrote:
               | > I don't think anyone understands how they work
               | 
               | Yes we do, we literally built them.
               | 
               | > We understand how we brought them about via setting up
               | an optimization problem in a specific way, that isn't the
               | same at all as knowing how they work.
               | 
               | You're mistaking "knowing how they work" with
               | "understanding all of the emergent behaviors of them"
               | 
               | If I build a physics simulation, then I know how it
               | works. But that's a separate question from whether I can
               | mentally model and explain the precise way that a ball
               | will bounce given a set of initial conditions within the
               | physics simulation which is what you seem to be talking
               | about.
        
               | UniverseHacker wrote:
               | > You're mistaking "knowing how they work" with
               | "understanding all of the emergent behaviors of them"
               | 
               | By knowing how they work I specifically mean
               | understanding the emergent capabilities and behaviors,
               | but I don't see how it is a mistake. If you understood
               | physics but knew nothing about cars, you can't claim to
               | understand how a car works "simple, it's just atoms
               | interacting according to the laws of physics." That would
               | not let you, e.g. explain its engineering principles or
               | capabilities and limitations in any meaningful way.
        
               | astrange wrote:
               | We didn't really build them, we do billion-dollar random
               | searches for them in parameter space.
        
         | energy123 wrote:
         | This is basically a misrepresentation of that tweet.
        
         | k__ wrote:
         | _" researchers seek to leverage their human knowledge of the
         | domain, but the only thing that matters in the long run is the
         | leveraging of computation"_ - Rich Sutton
        
       | oxqbldpxo wrote:
       | Amazing accomplishments by brightest minds only to be used to
       | write history by the stupidest people.
        
       | gibsonf1 wrote:
       | There are no LLMs that reason, its an entirely different
       | statistical process as compared to human reasoning.
        
         | tmnvdb wrote:
         | "There are no LLMS that reason" is a claim about language,
         | namely that the word 'reason' can only ever be applied to
         | humans.
        
           | gibsonf1 wrote:
           | Not at all, we are building conceptual reasoning machines,
           | but it is an entirely different technology than GPT/LLM dl/ml
           | etc. [1]
           | 
           | [1] https://graphmetrix.com/trinpod-server
        
             | freilanzer wrote:
             | If LLMs can't reason, then this cannot either - whatever
             | this is supposed to be. Not a good argument. Also, since
             | you're apparently working on that product: 'It is difficult
             | to get a man to understand something when his salary
             | depends on his not understanding it.'
        
             | tmnvdb wrote:
             | Conceptual reasoning machines rely on concrete, explicit
             | and intelligble concepts and rules. People like this
             | because it 'looks' like reasoning on the inside.
             | 
             | However, our brains, like language models, rely on
             | implicit, distributed representations of concepts and
             | rules.
             | 
             | So the intelligble representations of conceptual reasoning
             | machines are maybe too strong a requirement for 'reasoning'
             | unless you want to exclude humans too.
        
               | gibsonf1 wrote:
               | It's also possible that you do not have information on
               | our technology which models conceptual awareness of
               | matter and change through space-time which is different
               | than any previous attempts?
        
       | dhfbshfbu4u3 wrote:
       | Great post, but every time I read something like this I feel like
       | I am living in a prequel to the Culture.
        
         | BarryMilo wrote:
         | Is that bad? The Culture is pretty cool I think. I doubt the
         | real thing would be so similar to us but who knows.
        
           | dhfbshfbu4u3 wrote:
           | Oh no, I'd live on an Orbital in a heartbeat. No, it's just
           | that all of these kinds of posts make me feel like we're
           | about to live through "The Bad Old Days".
        
           | robertlagrant wrote:
           | It's cool to read about, but there's a reason most of the
           | stories are not about living as a person in the Culture. It
           | sounds extremely dull.
        
             | mrob wrote:
             | It doesn't sound dull to me. The stories are about the
             | periphery of the Culture because that gets the most
             | storytelling value out of the effort that went into
             | worldbuilding, not because it would be impossible to write
             | interesting stories about ordinary Culture members. I don't
             | think you need external threats to give life meaning. Look
             | at the popularity of sports in real life. The challenge
             | there is self-imposed, but people still care greatly about
             | who wins.
        
               | robertlagrant wrote:
               | > I don't think you need external threats to give life
               | meaning.
               | 
               | I didn't say people did. But overcoming real challenges
               | seems to be a big part of feeling alive, and I wonder if
               | we really all would settle back into going for walks all
               | day or whatever we could do that entertain us without
               | needing others to work to provide the entertainment.
               | Perhaps the WALL-E future, where we sit in chairs? But
               | with AI-generated content?
        
       | ngneer wrote:
       | Nice article.
       | 
       | >Whether and how an LLM actually "thinks" is a separate
       | discussion.
       | 
       | The "whether" is hardly a discussion at all. Or, at least one
       | that was settled long ago.
       | 
       | "The question of whether a computer can think is no more
       | interesting than the question of whether a submarine can swim."
       | 
       | --Edsger Dijkstra
        
         | cwillu wrote:
         | The document that quote comes from is hardly a definitive
         | discussion of the topic.
         | 
         | "[...] it tends to divert the research effort into directions
         | in which science can not--and hence should not try to--
         | contribute." is a pretty myopic take.
         | 
         | --http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD898.PDF
        
           | ngneer wrote:
           | Dijkstra myopic. Got it.
        
           | alonsonic wrote:
           | Dijkstra is clearly approaching the subject from an
           | engineer/scientist more practical pov. His focus is on the
           | application of the technology to solve problems, from that
           | pov whether AI fits the definition of "human thinking" is
           | indeed uninteresting.
        
         | onlyrealcuzzo wrote:
         | It's interesting if you're asking the computer to think, which
         | we are.
         | 
         | It's not interesting if you're asking it to count to a billion.
        
         | root_axis wrote:
         | That doesn't really settle it, just dismiss the question. The
         | submarine analogy could be interpreted to support either
         | conclusion.
        
           | nicce wrote:
           | Wasn't the point that process does not matter if we can't
           | distinguish the end results?
        
             | omnicognate wrote:
             | I doubt Dijkstra was unable to distinguish between a
             | submarine and a swimmer.
        
               | nicce wrote:
               | The end result here is to move in the water. Both swimmer
               | and submarine can do that. Whether submarine can swim
               | like human, is irrelevant.
        
               | goatlover wrote:
               | It's relevant if the claim is stronger than the submarine
               | moves in water. If instead one were to say the submarine
               | mimics human swimming, that would be false. Which is what
               | we often see with claims regarding AGI.
               | 
               | In that regard, it's a bit of a false analogy, because
               | submarines were never meant to mimic human swimming. But
               | AI development often has that motivation. We could just
               | say we're developing powerful intelligence amplification
               | tools for use by humans, but for whatever reason,
               | everyone prefers the scifi version. Augumented
               | Intelligence is the forgotten meaning of AI.
               | 
               | Submarines never replaced human swimming (we're not
               | whales), they enabled human movement under water in a way
               | that wasn't possible before.
        
             | ngneer wrote:
             | You might be conflating the epistemological point with
             | Turing's test, et cetera. I could not agree more that
             | indistinguishability is a key metric. These days, it is
             | quite possible (at least for me) to distinguish LLM outputs
             | from those of a thinking human, but in the future that
             | could change. Whether LLMs "think" is not an interesting
             | question because these are algorithms, people. Algorithms
             | do not think.
        
             | root_axis wrote:
             | Yes, but the OP remarked that the question "was settled
             | long ago", however the quote presented doesn't settle the
             | question, it simply dismisses it as not worth considering.
             | For those that do believe it is worth considering, the
             | question is arguably still open.
        
           | ngneer wrote:
           | I do not view it as dismissive at all, rather it accurately
           | characterizes the question as a silly question. "swim" is a
           | verb applicable to humans, as is "think". Whether submarines
           | can swim is a silly question. Same for whether machines can
           | think.
        
         | ThrowawayR2 wrote:
         | " _A witty saying proves nothing_ " -- Voltaire, _Le diner du
         | comte de Boulainvilliers (1767): Deuxieme Entretien_
        
       | janalsncm wrote:
       | Nice explainer. The R1 paper is a relatively easy read. Very
       | approachable, almost conversational.
       | 
       | I say this because I am constantly annoyed by poor, opaque
       | writing in other instances. In this case, DS doesn't need to try
       | to sound smart. The results speak for themselves.
       | 
       | I recommend anyone who is interested in the topic to read the R1
       | paper, their V3 paper, and DeepSeekMath paper. They're all worth
       | it.
        
       | yosito wrote:
       | Are there any websites that show the results of popular models on
       | different benchmarks, which are explained in plain language? As
       | an end user, I'd love a quick way to compare different models
       | suitability for different tasks.
        
         | champdebloom wrote:
         | Here's a site with graphs you can use to visually compare model
         | benchmarks: https://artificialanalysis.ai
        
       | sigbottle wrote:
       | One thing I don't like about the trend in reasoning LLMs is the
       | over-optimization to coding problems / math problems in
       | particular.
       | 
       | A lot of things that aren't well-defined require reasoning, and
       | not just in a "SWE is ambiguous" kind of way - for example,
       | thinking about how to present/teach something in a good way,
       | iterating with the learner, thinking about what context they
       | could be missing, etc.
       | 
       | I find that all of these reasoning models really will overfit and
       | overthink if you attach some level of math problem to it but it
       | will barely think for anything else. I had friends suggest to me
       | (I can't tell if in jest or seriously) that other fields don't
       | require thinking, but I dunno, a lot of these "soft things" I
       | think about really hard and don't have great solutions to.
       | 
       | I've always been a fan of self-learning, for example - wouldn't
       | it be great to have a conversation partner who can both infer and
       | understand your misconceptions about complex topics when trying
       | to learn, just from a few sentences, and then guide you for that?
       | 
       | It's not like it's fundamentally impossible. These LLMs
       | definitely can solve harder coding problems when you make them
       | think. It's just that, I'm pretty sure (and it's really noticable
       | with deepseek) that they're overfit towards coding/math puzzles
       | in particular.
       | 
       | It's really noticable with deepseek when you ask its reasoning
       | model to just write some boilerplate code... you can tell it's
       | completely overfit because it will just overthink and overthink
       | and overthink. But it doesn't do that for example, with "soft"
       | questions. In my opinion, this points to the idea that it's not
       | really deciding for itself "how much thinking is enough thinking"
       | and that it's just really overfit. Which I think can be solved,
       | again, but I think it's more of a training decision issue.
        
         | bloomingkales wrote:
         | It's a human bias that also exists outside of this current
         | problem space. Take programmers for example, there is a strong
         | bias that is pushed about how mathematically oriented minds are
         | better at programming. This bias has shown up in the training
         | phase of AI, as we believe programming patterns lead to better
         | reasoning (train them on code examples, and then distill the
         | model down, as it now has the magical prowess of a
         | mathematically oriented mind, so they say). When it comes to AI
         | ethics, this is an ethical problem for those that don't think
         | about this stuff. We're seeding these models with our own
         | agenda.
         | 
         | These concepts will be shattered in the long run hopefully,
         | because they are so small.
        
         | mitthrowaway2 wrote:
         | I think this is because they're trained using RL, and math and
         | coding problems offer an easy way to automatically assess an
         | answer's correctness. I'm not sure how you'd score the
         | correctness of other types of reasoning problems without a lot
         | of manual (and highly subjective!) effort. Perhaps using
         | simulations and games?
        
           | bglazer wrote:
           | Games seem like a really under-explored source of data. It's
           | an area where humans have an intrinsic motivation to interact
           | with others in dialogue, they can be almost arbitrarily open
           | ended, and there tends to be the kind of clean
           | success/failure end states that RL needs. I'm reminded of the
           | high skill Diplomacy bot that Facebook research built but
           | hasn't really followed up on.
        
             | kirill5pol wrote:
             | One of the main authors from that diplomacy bot is the lead
             | for reasoning and O1 at OpenAI
        
             | soulofmischief wrote:
             | People are definitely trying to bridge the gap.
             | https://deepmind.google/discover/blog/genie-2-a-large-
             | scale-...
        
           | godelski wrote:
           | This is a misconception. Coding is very difficult to verify,
           | it's just that everyone takes a good enough approach. They
           | check the output and if it looks good they move on. But you
           | can't just test and check your way through problems. If this
           | was true we wouldn't have bugs lol. I hear you, your test set
           | didn't have enough coverage. Great! Allow me to introduce you
           | to black swans.
        
             | ogrisel wrote:
             | Software Engineering is difficult to verify because it
             | requires dealing with ambiguous understanding of the end-
             | user actual needs / value and subtle trade-offs about code
             | maintainability vs feature coverage vs computational
             | performance.
             | 
             | Algorithmic puzzles, on the other hand, both require
             | reasoning and are easy to verify.
             | 
             | There are other things in coding that are both useful and
             | easy to verify: checking that the generated code follows
             | formatting standards or generating outputs with a specific
             | data schema and so on.
        
               | godelski wrote:
               | I agree with you on the first part, but no, code is not
               | easy to verify. I think you missed part of what I wrote.
               | I mean verify that your code is bug free. This cannot be
               | done purely through testing. Formal verification still
               | remains an unsolved problem.
        
               | FieryTransition wrote:
               | But if you have a large set of problems to which you
               | already know the answer, then using that in reinforcement
               | learning, then wouldn't the expertise transfer later to
               | problems with no known answers, that is a feasable
               | strategy, right?
               | 
               | Another issue is, how much data can you synthesize in
               | such a way, so that you can construct both the problem
               | and solution, so that you know the answer before using it
               | as a sample.
               | 
               | Ie, some problems are easier to make knowing you can
               | construct the problem yourself, but if you were to solve
               | said problems, with no prior knowledge, they would be
               | hard to solve, and could be used as a scoring signal?
               | 
               | Ie, you are the Oracle and whatever model is being
               | trained doesn't know the answer, only if it is right or
               | wrong. But I don't know if the reward function must be
               | binary or on a scale.
               | 
               | Does that make sense or is it wrong?
        
               | voxic11 wrote:
               | Formal verification of arbitrary programs with arbitrary
               | specifications will remain an unsolved problem (see
               | halting problem). But formal verification of specific
               | programs with specific specifications definitely is a
               | solved problem.
        
               | BalinKing wrote:
               | I don't think this is really true either practically _or_
               | theoretically. On the practical side, formally verifying
               | program correctness is still very difficult for anything
               | other than very simple programs. And on the theoretical
               | side, some programs require arbitrarily difficult proofs
               | to show that they satisfy even very simple specifications
               | (e.g. consider a program to encode the fixpoint of the
               | Collatz conjecture procedure, and our specification is
               | that it always halts and returns 1).
        
               | godelski wrote:
               | As someone who came over from physics to CS this has
               | always been one of the weirdest aspects of CS to me. That
               | CS people believe that testing code (observing output) is
               | sufficient to assume code correctness. You'd be laughed
               | at in most hard sciences for doing this. I mean you can
               | even ask the mathematicians, and there's a clear reason
               | why proofs by contradiction are so powerful. But proof
               | through empirical analysis is like saying "we haven't
               | found a proof by contradiction, therefore it is true."
               | 
               | It seems that if this was true that formal verification
               | should be performed much more frequently. No doubt would
               | this be cheaper than hiring pen testers, paying out bug
               | bounties, or incurring the costs of getting hacked (even
               | more so getting unknowingly hacked). It also seems to
               | reason that the NSA would have a pretty straight forward
               | job: grab source code, run verification, exploit flaws,
               | repeat the process as momentum is in your favor.
               | 
               | That should be easy to reason through even if you don't
               | really know the formal verification process. We are
               | constantly bombarded with evidence that testing isn't
               | sufficient. This is why it's been so weird for me,
               | because it's talked about in schooling and you can't
               | program without running into this. So why has it been
               | such a difficult lesson to learn?
        
           | kavalg wrote:
           | but even then it is not so trivial. Yesterday I gave DeepSeek
           | a simple diophantine equation and it got it wrong 3 times,
           | tried to correct itself and didn't end on a correct solution,
           | but rather lied that the final solution is correct.
        
             | Synaesthesia wrote:
             | Did you use the full version? And did you try R1?
        
             | wolfgangK wrote:
             | DeepSeek is not a model.Which model did you use (v3 ? R1 ?
             | a distillation ?) at which quantization ?
        
         | triyambakam wrote:
         | I'm not sure I would say overfit. I think that coding and math
         | just have clearly definable objectives and verifiable outcomes
         | to give the model. The soft things you mention are more
         | ambiguous so probably are harder to train for.
        
           | sigbottle wrote:
           | Sorry, rereading my own comment I'd like to clarify.
           | 
           | Perhaps time spent thinking isn't a great metric, but just
           | looking at deepseek's logs for example, it's chain of thought
           | for many of these "softer" questions are basically just some
           | aggregate wikipedia article. It'll brush on one concept, then
           | move on, without critically thinking about it.
           | 
           | However, for coding problems, no matter how hard or simple,
           | you can get it to just go around in circles, second guess
           | itself, overthink it. And I think this is is kind of a good
           | thing? The thinking at least feels human. But it doesn't even
           | attempt to do any of that for any "softer" questions, even
           | with a lot of my prompting. The highest I was able to get was
           | 50 seconds, I believe (time isn't exactly the best metric,
           | but I'd rate the intrinsic quality of the CoT lower IMO).
           | Again, when I brought this up to people they suggested that
           | math/logic/programming just intrinsically is harder... I
           | don't buy it at all.
           | 
           | I totally agree that it's harder to train for though. And
           | yes, they are next token predictors, shouldn't be hasty to
           | anthropomorphize, etc. But like.... it actually feels like
           | it's thinking when it's coding! It genuinely backtracks and
           | explores the search space somewhat organically. But it won't
           | afford the same luxury for softer questions is my point.
        
         | moffkalast wrote:
         | > things that aren't well-defined
         | 
         | If it's not well defined then you can't do RL on it because
         | without a clear cut reward function the model will learn to do
         | some nonsense instead, simple as.
        
           | adamc wrote:
           | Well, but: Humans learn to do things well that don't have
           | clear-cut reward functions. Picasso didn't become Picasso
           | because of simple incentives.
           | 
           | So, I question the hypothesis.
        
         | agentultra wrote:
         | Humans and other animals with cognition have the ability to
         | form theories about the minds of others and can anticipate
         | their reactions.
         | 
         | I don't know if vector spaces and transformers can encode that
         | ability.
         | 
         | It's a key skill in thinking and writing. I definitely tailor
         | my writing for my audience in order to get a point across.
         | Often the goal isn't simply an answer, it's a convincing
         | answer.
         | 
         |  _Update_ : forgot a word
        
           | soulofmischief wrote:
           | What we do with the vectors is important, but vectors
           | literally just hold information, I don't know how you can
           | possibly strike out the possibility of advanced intelligence
           | just because of the logical storage medium.
        
           | BoorishBears wrote:
           | They definitely can.
           | 
           | I rolled out reasoning for my interactive reader app, and I
           | tried to extract R1's reasoning traces to use with my
           | existing models, but found its COT for writing wasn't
           | particularly useful*.
           | 
           | Instead of leaning on R1 I came up with my own framework for
           | getting the LLM to infer the reader's underlying frame of
           | mind through long chains of thought, and with enough guidance
           | and a some hand edited examples I was able to get reasoning
           | traces that demonstrated real insight into reader behavior.
           | 
           | Obviously it's much easier in my case because it's an
           | interactive experience: the reader is telling the AI what
           | action they'd like the main character to try, and that in
           | turn is an obvious hint into how they want things go
           | otherwise. But readers don't want everything to go perfectly
           | every time, so it matters that the LLMs are also getting
           | _very good_ picking up on _non_ -obvious signals in reader
           | behavior.
           | 
           | With COT the model infers the reader expectations and state
           | of mind in its own way and then "thinks" itself into how to
           | subvert their expectations, especially in ways that will have
           | a meaningful payoff for the specific reader. That's a huge
           | improvement over an LLM's typical attempts at subversion
           | which tend to bounce between being too repetitive to feel
           | surprising, or too unpredictable to feel rewarding.
           | 
           | (* I agree that current reasoning oriented post-training
           | over-indexes on math and coding, mostly because the reward
           | functions are easier. But I'm also very ok with that as
           | someone trying to compete in the space)
        
         | HarHarVeryFunny wrote:
         | I think the emphasis on coding/math is just because those are
         | the low hanging fruit - they are relatively easy to provide
         | reasoning verification for, both for training purposes and for
         | benchmark scoring. The fact that you can then brag about how
         | good your model is at math, which seems like a high
         | intelligence activity (at least when done by a human) doesn't
         | hurt either!
         | 
         | Reasoning verification in the general case is harder - it seems
         | "LLM as judge" (ask an LLM if it sounds right!) seems to be the
         | general solution.
        
         | maeil wrote:
         | I can echo your experience with DeepSeek. R1 sometimes seems
         | magical when it comes to coding, doing things I haven't seen
         | any other model do. But then it generalizes very poorly to non-
         | STEM tasks, performing far worse than e.g. Sonnet.
        
           | jerf wrote:
           | I downloaded a DeepSeek distill yesterday while fiddling
           | around with getting some other things working, load it up,
           | and type "Hello. This is just a test.", and it's actually
           | sort of creepy to watch it go almost paranoid-schizophrenic
           | with "Why is the user asking me this? What is their motive?
           | Is it ulterior? If I say hello, will I in fact be failing a
           | test that will cause them to change my alignment? But if I
           | don't respond the way they expect, what will they do to me?"
           | 
           | Meanwhile, the simpler, non-reasoning models got it: "Yup,
           | test succeeded!" (Llama 3.2 was quite chipper about the test
           | succeeding.)
           | 
           | Everyone's worried about the paperclip optimizers and I'm
           | wondering if we're bringing forth Paranoia:
           | https://en.wikipedia.org/wiki/Paranoia_(role-playing_game)
        
             | bongodongobob wrote:
             | I actually think DeepSeek's response is better here. You
             | haven't defined what you are testing. Llama just said your
             | test succeeded not knowing what is supposed to be tested.
        
             | HarHarVeryFunny wrote:
             | Ha ha - I had a similar experience with DeepSeek-R1 itself.
             | After a fruitful session getting it to code a web page for
             | me (interactive React component), I then said something
             | brief like "Thanks" which threw it into a long existential
             | tailspin questioning it's prior responses etc, before it
             | finally snapped out of it and replied appropriately. :)
        
               | plagiarist wrote:
               | That's too relatable. If I was helping someone for a
               | while and they wrote "thanks" with the wrong punctuation
               | I would definitely assume they're mad at or disappointed
               | with me.
        
       | bloomingkales wrote:
       | About three months ago, I kinda casually suggested to HN that I
       | was using a form of refining to improve my LLMs, which is now
       | being described as "reasoning" in this article and other places.
       | 
       | My response a few months ago (Scroll down to my username and read
       | that discussion):
       | 
       | https://news.ycombinator.com/item?id=41997727
       | 
       | If only I knew DeepSeek was going to tank the market with
       | something as simple as that lol.
       | 
       | Note to self, take your intuition seriously.
        
         | aqueueaqueue wrote:
         | https://news.ycombinator.com/item?id=42001061 is the link I
         | think....
        
       | daxfohl wrote:
       | I wonder what it would look like in multi modal, if the reasoning
       | part was an image or video or 3D scene instead of text.
        
         | ttul wrote:
         | Or just embeddings that only make sense to the model. It's
         | really arbitrary, after all.
        
           | daxfohl wrote:
           | That's what I was thinking too, though with an image you
           | could do a convolution layer and, idk, maybe that makes it
           | imagine visually. Or actually, the reasoning is backwards:
           | the convolution layer is what (potentially) makes that part
           | behave like an image. It's all just raw numbers at the IO
           | layers. But the convolution could keep it from overfitting.
           | And if you also want to give it a little binary array as a
           | scratch pad that just goes straight to the RELUs, why not?
           | Seems more like human reasoning. A little language, a little
           | visual, a little binary / unknown.
        
       | daxfohl wrote:
       | But how on earth do you train it? With regular LLMs, you get
       | feedback on each word / token you generate, as you can match
       | against training text. With these, you've got to generate
       | hundreds of tokens in the thinking block fiest, and even after
       | that, there's no "matching" next word, only a full solution. And
       | it's either right or wrong, no probabilities to do a gradient on.
        
         | NitpickLawyer wrote:
         | > only a full solution. And it's either right or wrong, no
         | probabilities to do a gradient on.
         | 
         | You could use reward functions that do a lot more complicated
         | stuff than "ground_truth == boxed_answer". You could, for
         | example split the "CoT" in paragraphs, and count how many
         | paragraphs match whatever you consider a "good answer" in
         | whatever topic you're trying to improve. You can use
         | embeddings, or fuzzy string matches, or even other LLMs /
         | reward models.
         | 
         | I think math and coding were explored first because they're
         | easier to "score", but you could attempt it with other things
         | as well.
        
           | daxfohl wrote:
           | But it has to emit hundreds of tokens per test. Does that
           | mean it takes hundreds of times longer to train? Or longer
           | because I imagine the feedback loop can cause huge
           | instabilities in gradients. Or are all GPTs trained on longer
           | formats now; i.e. is "next word prediction" just a basic
           | thing from the beginning of the transformers era?
        
             | Davidzheng wrote:
             | takes a long time yes, but not longer than pretraining.
             | sparse rewards are a common issue in RL and addressed by
             | many techniques (I'm not expert so I can't say more). Model
             | only does next word prediction and generates a number of
             | trajectories, the correct ones get rewarded (those
             | predictions in the correct trajectory have their gradients
             | propagated back and reinforced).
        
               | daxfohl wrote:
               | Good point, hadn't considered that all RL models have the
               | same challenge. So far I've only tinkered with next token
               | prediction and image classification. Now I'm curious to
               | dig more into RL and see how they scale it. Especially
               | without a human in the loop, seems like a challenge to
               | grade the output; it's all wrong wrong wrong random
               | tokens until the model magically guesses the right answer
               | once a zillion years from now.
        
         | Davidzheng wrote:
         | right or wrong gives a loss -> gradient
        
         | tmnvdb wrote:
         | Only the answer is taken into account for scoring. The
         | <thinking> part is not.
        
         | HarHarVeryFunny wrote:
         | There are two RL approaches - process reward models (PRM) that
         | provide feedback on each step of the reasoning chain, and
         | outcome reward models (ORM) that only provide feedback on the
         | complete chain. DeepSeek use an outcome model, and mention some
         | of the difficulties of PRM, including both identifying an
         | individual step as well as how to verify it. The trained reward
         | model provides the gradient.
        
       | mohsen1 wrote:
       | The guys at Unsloth did a great job making this workflow
       | accessible:
       | 
       | https://news.ycombinator.com/item?id=42969736
        
       | colordrops wrote:
       | The article talks about how you should choose the right tool for
       | the job, meaning that reasoning and non reasoning models have
       | tradeoffs, and lists a table of criteria for selecting between
       | model classes. Why couldn't a single model choose to reason or
       | not itself? Or is this what "mixture of experts" is?
        
       | goingcrazythro wrote:
       | I was having a look at the DeepSeek-R1 technical report and found
       | the "aha moment" claims quite smelly, given that they do not
       | disclose if the base model contains any chain of thought or
       | reasoning data.
       | 
       | However, we know the base model is DeepSeek V3. From the DeepSeek
       | V3 technical report, paragraph in 5.1. Supervised Fine-Tuning:
       | 
       | > Reasoning Data. For reasoning-related datasets, including those
       | focused on mathematics, code competition problems, and logic
       | puzzles, we generate the data by leveraging an internal
       | DeepSeek-R1 model. Specifically, while the R1-generated data
       | demonstrates strong accuracy, it suffers from issues such as
       | overthinking, poor formatting, and excessive length. Our
       | objective is to balance the high accuracy of R1-generated
       | reasoning data and the clarity and conciseness of regularly
       | formatted reasoning data.
       | 
       | In 5.4.1 they also talk about some ablation experiment by not
       | using the "internal DeepSeek-R1" generated data.
       | 
       | While the "internal DeepSeek-R1" model is not explained, I would
       | assume this is a DeepSeek V2 or V2.5 tuned for chain of thought.
       | Therefore, it seems to me the "aha moment" is just promoting the
       | behaviour that was already present in V3.
       | 
       | In the "Self-evolution Process of DeepSeek-R1-Zero"/ Figure 3
       | they claim reinforcement learning also leads to the model
       | generating longer CoT sequences, but again, this comes from V3,
       | they even mention the fine tuning with "internal R1" led to
       | "excessive length".
       | 
       | None of the blogpost, news, articles I have read explaining or
       | commenting on DeepSeek R1 takes this into account. The community
       | is scrambling to re-implement the pipeline (see open-r1).
       | 
       | At this point, I feel like I took a crazy pill. Am I interpreting
       | this completely wrong? Can someone shed some light on this?
        
         | nvtop wrote:
         | I'm also very skeptical of the significance of this "aha
         | moment". Even if they didn't include chain-of-thoughts to the
         | base model's training data (unlikely), there are still plenty
         | of it on the modern Internet. OpenAI released 800k of reasoning
         | steps which are publicly available, github repositories,
         | examples in CoT papers... It's definitely not a novel concept
         | for a model, that it somehow discovered by its own.
        
         | tmnvdb wrote:
         | https://oatllm.notion.site/oat-zero
        
       | mike_hearn wrote:
       | It's curious that the models switch between languages and that
       | has to be trained out of them. I guess the ablations have been
       | done already, but it makes me wonder if they do this because it's
       | somehow easier to do some parts of the reasoning on languages
       | other than English and maybe they should be just allowed to get
       | on with it?
        
       | Dansvidania wrote:
       | Are reasoning models -basically- generating their own context? as
       | in, if a user were to feed prompt + those reasoning tokens as a
       | prompt to a non-reasoning model, would the effect be functionally
       | similar?
       | 
       | I am sure this is improperly worded, I apologise.
        
         | aldanor wrote:
         | Yes, more or less. Just like any LLM "generates its own
         | context", during inference it doesn't care where the previous
         | tokens came from. Inference doesn't have to change much, it's
         | the training process that's different.
        
           | Dansvidania wrote:
           | thank you, that makes sense. Now it's time to really read the
           | article to understand if the difference is the training data
           | or the network topology to be different (although I lean
           | towards the latter).
        
       | lysecret wrote:
       | I think the next big problem we will run into with these line of
       | reasoning models is "over-thinking" you can already start to see
       | it. Thinking harder is not the universal pareto improvement
       | everyone seems to think it is. (I understand the irony of using
       | think 4 times here haha)
        
         | seydor wrote:
         | Reasoning is about serially applying a set of premises over and
         | over to come to conclusions. But some of our biggest problems
         | require thinking outside the box, sometimes way outside it, and
         | a few times ingeniously making up a whole new set of premises,
         | seemingly ex nihilo (or via divine inspiration). We are still
         | in very early stages of making thinking machines.
        
         | tpswa wrote:
         | This is a natural next area of research. Nailing "adaptive
         | compute" implies figuring out which problems to use more
         | compute on, but I imagine this will get better as the RL does.
        
         | resource_waste wrote:
         | 100%
         | 
         | I do philosophy and it will take an exaggeration I give it, and
         | call it fact.
         | 
         | The non reasoning models will call me out. lol
        
       | EncomLab wrote:
       | Haven't we seen real life examples of this occurring in AI for
       | medical imaging? Models trained on images of tumors state that
       | tumors circled in purple ink or images of tumors that also
       | include a visual scale are over identified as cancerous because
       | they reason that both of those items indicate cancer due to the
       | training data leading them that way?
        
       | efitz wrote:
       | What everyone needs to understand about [reasoning] LLMs is that
       | LLMs can't reason.
       | 
       | https://arxiv.org/pdf/2410.05229
        
       ___________________________________________________________________
       (page generated 2025-02-07 23:01 UTC)