[HN Gopher] Understanding Reasoning LLMs
___________________________________________________________________
Understanding Reasoning LLMs
Author : sebg
Score : 422 points
Date : 2025-02-06 21:34 UTC (1 days ago)
(HTM) web link (magazine.sebastianraschka.com)
(TXT) w3m dump (magazine.sebastianraschka.com)
| behnamoh wrote:
| doesn't it seem like these models are getting to the point where
| even conceiving their training and development is less and less
| possible for the general public?
|
| I mean, we already knew only a handful of companies with capital
| could train them, but at least the principles, algorithms, etc.
| were accessible to individuals who wanted to create their own -
| much simpler - models.
|
| it seems that era is quickly ending, and we are entering the era
| of truly "magic" AI models that no one knows how they work
| because companies keep their secret sauces...
| HarHarVeryFunny wrote:
| I don't think it's realistic to expect to have access to the
| same training data as the big labs that are paying people to
| generate it for them, but hopefully there will be open source
| ones that are still decent.
|
| At the end of the day current O1-like reasoning models are
| still just fine-tuned LLMs, and don't even need RL if you have
| access to (or can generate) a suitable training set. The
| DeepSeek R1 paper outlined their bootstrapping process, and
| HuggingFace (and no doubt others) are trying to duplicate it.
| antirez wrote:
| In recent weeks what's happening is exactly the contrary.
| fspeech wrote:
| Recent developments like V3, R1 and S1 are actually clarifying
| and pointing towards more understandable, efficient and
| therefore more accessible models.
| tmnvdb wrote:
| We have been in the 'magic scaling' era for a while now. While
| the basic architecture of language models is reasonably simple
| and well understood, the emergent effects of making models
| bigger are largely magic even to the researchers, only to be
| studied emperically after the fact.
| dr_dshiv wrote:
| How important is it that the reasoning takes place in another
| thread versus just chain-of-thought in the same thread? I feel
| like it makes a difference, but I have no evidence.
| vector_spaces wrote:
| Is there any work being done in training LLMs on more restricted
| formal languages? Something like a constraint solver or automated
| theorem prover, but much lower level. Specifically something that
| isn't natural language. That's the only path I could see towards
| reasoning models being truly effective
|
| I know there is work being done with e.g. Lean integration with
| ChatGPT, but that's not what I mean exactly -- there's still this
| shakey natural-language-trained-LLM glue in the driver's seat
|
| Like I'm envisioning something that has the creativity to try
| different things, but then JIT compile their chain of thought,
| and avoid bad paths
| mindwok wrote:
| How would that be different from something like ChatGPT
| executing Lean? That's exactly what humans do, we have messy
| reasoning that we then write down in formal logic and compile
| to see if it holds.
| gsam wrote:
| In my mind, the pure reinforcement learning approach of
| DeepSeek is the most practical way to do this. Essentially it
| needs to continually refine and find more sound(?) subspaces of
| the latent (embedding) space. Now this could be the subspace
| which is just Python code (or some other human-invented
| subspace), but I don't think that would be optimal for the
| overall architecture.
|
| The reason why it seems the most reasonable path is because
| when you create restrictions like this you hamper search
| viability (and in a high multi-dimensional subspace, that's a
| massive loss because you can arrive at a result from many
| directions). It's like regular genetic programming vs typed-
| genetic programming. When you discard all your useful results,
| you can't go anywhere near as fast. There will be a threshold
| where constructivist, generative schemes (e.g. reasoning with
| automata and all kinds of fun we've neglected) will be the way
| forward, but I don't think we've hit that point yet. It seems
| to me that such a point does exist because if you have fast
| heuristics on when types unify, you no longer hamper the search
| speed but gain many benefits in soundness.
|
| One of the greatest human achievements of all time is probably
| this latent embedding space -- one that we can actually
| interface with. It's a new lingua franca.
|
| These are just my cloudy current thoughts.
| danielmarkbruce wrote:
| fwiw, most people don't _really_ grok the power of latent
| space wrt language models. Like, you say it, I believe it,
| but most people don 't really grasp it.
| ttul wrote:
| Image generation models also have an insanely rich latent
| space. People will be squeezing value out of SDXL for many
| years to come.
| HarHarVeryFunny wrote:
| DeepSeek's approach with R1 wasn't pure RL - they used RL
| only to develop R0 from their V3 base model, but then went
| though two iterations of using current model to generate
| synthetic reasoning data, SFT on that, then RL fine-tuning,
| and repeat.
| truculent wrote:
| I think something like structured generation might work in this
| context
| colonial wrote:
| If I understand your idea correctly, I don't think a "pure" LLM
| would derive much advantage from this. Sure, you can constrain
| them to generate something _syntactically_ valid, but there 's
| no way to make them generate something _semantically_ valid
| 100% of the time. I 've seen frontier models muck up their
| function calling JSON more than once.
|
| As long as you're using something statistical like
| transformers, you're going to need deterministic bolt-ons like
| Lean.
| nextaccountic wrote:
| > there's no way to make them generate something semantically
| valid 100% of the time.
|
| You don't need to generate semantically valid reasoning 100%
| of time for such an approach to be useful. You just need to
| use semantic data to bias them to follow semantically valid
| paths more often than not (and sometimes consider using
| constraint solving on the spot, like offloading into a SMT
| solver or even incorporating it in the model somehow; it
| would be nice to have AI models that can combine the
| strengths of both GPUs and CPUs). And, what's more useful,
| verify that the reasoning is valid at the _end_ of the train
| of thought, and if it is not, bail out and attempt something
| else.
|
| If you see AI as solving an optimization problem (given a
| question, give a good answer) it's kind of evident that you
| need to probe the space of ideas in an exploratory fashion,
| sometimes making unfounded leaps (of the "it was revealed to
| me in a dream" sort), and in this sense it could even be
| useful that AI can sometimes hallucinate bullshit. But they
| need afterwards to come with a good justification for the end
| result, and if they can't find one they are forced to discard
| their result (even if it's true). Just like humans often come
| up with ideas in an irrational, subconscious way, and then
| proceed to rationalize them. One way to implement this kind
| of thing is to have the LLM generate code for a theorem
| prover like Coq or Lean, and then at the end run the code -
| if the prover rejects the code, the reasoning can't possibly
| be right, and the AI needs to get back to the drawing board
|
| (Now, if the prover accepts the code, the answer may still be
| wrong, if the premises were encoded incorrectly - but it
| would still be a net improvement, specially if people can
| review the Coq code to spot mistakes)
| soulofmischief wrote:
| I wholeheartedly disagree. Logic is inherently statistical
| due to the very nature of empirical sampling, which is the
| only method we have for verification. We will eventually find
| that it's classical, non-statistical logic which was the
| (useful) approximation/hack, and that statistical reasoning
| is a lot more "pure" and robust of an approach.
|
| I went into a little more detail here last week:
| https://news.ycombinator.com/item?id=42871894
|
| > My personal insight is that "reasoning" is simply the
| application of a probabilistic reasoning manifold on an input
| in order to transform it into constrained output that serves
| the stability or evolution of a system.
|
| > This manifold is constructed via learning a
| decontextualized pattern space on a given set of inputs.
| Given the inherent probabilistic nature of sampling, true
| reasoning is expressed in terms of probabilities, not axioms.
| It may be possible to discover axioms by locating fixed
| points or attractors on the manifold, but ultimately you're
| looking at a probabilistic manifold constructed from your
| input set.
|
| I've been writing and working on this problem a lot over the
| last few months and hopefully will have something more formal
| and actionable to share eventually. Right now I'm at the,
| "okay, this is evident and internally consistent, but what
| can we actually _do_ with it that other techniques can 't
| already accomplish?" phase that a lot of these metacognitive
| theories get stuck on.
| colonial wrote:
| > Logic is inherently statistical due to the very nature of
| empirical sampling, which is the only method we have for
| verification.
|
| What? I'm sorry, but this is ridiculous. You can make
| plenty of sound logical arguments in an empirical vacuum.
| This is why we have proof by induction - some things can't
| be verified by taking samples.
| soulofmischief wrote:
| I'm speaking more about how we assess the relevance of a
| logical system to the real world. Even if a system is
| internally self-consistent, its utility depends on
| whether its premises and conclusions align with what we
| observe empirically. And because empirical observation is
| inherently statistical due to sampling and measurement
| limitations, the very act of verifying a logical system's
| applicability to reality introduces a statistical
| element. We just typically ignore this element because
| some of these systems seem to hold up consistently enough
| that we can take them for granted.
| raincole wrote:
| AlphaProof. Although I don't know if it's large enough to be
| called an LLM.
|
| https://deepmind.google/discover/blog/ai-solves-imo-problems...
| Terr_ wrote:
| I think that would be a fundamental mismatch. LLMs are
| statistical and lossy and messy, which is what (paradoxically)
| permits them to get surprisingly-decent results out of messy
| problems that draw upon an enormous number and variety of messy
| examples.
|
| But for a rigorously structured language with formal fixed
| meaning... Now the the LLM has no advantage anymore, only
| serious drawbacks and limitations. Save yourself millions of
| dollars and just write a normal parser, expression evaluator,
| SAT solver, etc.
|
| You'll get answers faster, using fewer resources, with fewer
| fundamentally unfixable bugs, and it will actually be able to
| do math.
| prideout wrote:
| This article has a superb diagram of the DeepSeek training
| pipeline.
| aithrowawaycomm wrote:
| I like Raschka's writing, even if he is considerably more
| optimistic about this tech than I am. But I think it's
| inappropriate to claim that models like R1 are "good at deductive
| or inductive reasoning" when that is demonstrably not true, they
| are incapable of even the simplest "out-of-distribution"
| deductive reasoning:
| https://xcancel.com/JJitsev/status/1883158738661691878
|
| They are certainly capable of doing is a wide variety of
| computations that _simulate_ reasoning, and maybe that 's good
| enough for your use case. But it is unpredictably brittle unless
| you spend a lot on o1-pro (and even then...). Raschka has a line
| about "whether and how an LLM actually 'thinks' is a separate
| discussion" but this isn't about semantics. R1 clearly sucks at
| deductive reasoning and you will not understand "reasoning" LLMs
| if you take DeepSeek's claims at face value.
|
| It seems especially incurious for him to copy-paste the "a-ha
| moment" from Deepseek's technical report without critically
| investigating it. DeepSeek's claims are unscientific, without
| real evidence, and seem focused on hype and investment:
| This moment is not only an "aha moment" for the model but also
| for the researchers observing its behavior. It underscores the
| power and beauty of reinforcement learning: rather than
| explicitly teaching the model on how to solve a problem, we
| simply provide it with the right incentives, and it autonomously
| develops advanced problem-solving strategies. The
| "aha moment" serves as a powerful reminder of the potential of RL
| to unlock new levels of intelligence in artificial systems,
| paving the way for more autonomous and adaptive models in the
| future.
|
| Perhaps it was able to solve that tricky Olympiad problem, but
| there are an infinite variety of 1st grade math problems it is
| not able to solve. I doubt it's even reliably able to solve
| simple variations of that root problem. Maybe it is! But it's
| frustrating how little skepticism there is about CoT, reasoning
| traces, etc.
| scarmig wrote:
| > But I think it's inappropriate to claim that models like R1
| are "good at deductive or inductive reasoning" when that is
| demonstrably not true, they are incapable of even the simplest
| "out-of-distribution" deductive reasoning:
| https://xcancel.com/JJitsev/status/1883158738661691878
|
| Your link says that R1, not all models like R1, fails at
| generalization.
|
| Of particular note:
|
| > We expose DeepSeek R1 to the variations of AIW Friends
| problem and compare model behavior to o1-preview, o1-mini and
| Claude 3.5 Sonnet. o1-preview handles the problem robustly,
| DeepSeek R1 shows strong fluctuations across variations with
| distribution very similar to o1-mini.
| Legend2440 wrote:
| The way the authors talk about LLMs really rubs me the wrong
| way. They spend more of the paper talking up the 'claims'
| about LLMs that they are going to debunk than actually doing
| any interesting study.
|
| They came into this with the assumption that LLMs are just a
| cheap trick. As a result, they deliberately searched for an
| example of failure, rather than trying to do an honest
| assessment of generalization capabilities.
| suddenlybananas wrote:
| >They came into this with the assumption that LLMs are just
| a cheap trick. As a result, they deliberately searched for
| an example of failure, rather than trying to do an honest
| assessment of generalization capabilities.
|
| And lo and behold, they still found a glaring failure. You
| can't fault them for not buying into the hype.
| Legend2440 wrote:
| But it is still dishonest to declare reasoning LLMs a
| scam simply because you searched for a failure mode.
|
| If given a few hundred tries, I bet I could find an
| example where you reason poorly too. Wikipedia has a
| whole list of common failure modes of human reasoning:
| https://en.wikipedia.org/wiki/List_of_fallacies
| daveguy wrote:
| Well, given the success rate is no more than 90% in the
| best cases. You could probably find a failure in about 10
| tries. The only exception is o1-preview. And this is just
| a simple substitution of parameters.
| o11c wrote:
| What the hype crowd doesn't get is that for most people, "a
| tool that randomly breaks" is not useful.
| rixed wrote:
| The fact that a tool can break or that the company
| manufacturing that tool lies about its abilities, are
| annoying but do not imply that the tool is useless.
|
| I experience LLM "reasoning" failure several times a day,
| yet I find them useful.
| HarHarVeryFunny wrote:
| I'd expect that OpenAI's stronger reasoning models also don't
| generalize too far outside of the areas they are trained for.
| At the end of the day these are still just LLMs, trying to
| predict continuations, and how well they do is going to
| depend on how well the problem at hand matches their training
| data.
|
| Perhaps the type of RL used to train them also has an effect
| on generalization, but choice of training data has to play a
| large part.
| og_kalu wrote:
| Nobody generalizes too far outside the areas they're
| trained for. Probably that length, 'far' is shorter with
| today's state of the art but the presence of failure modes
| don't mean anything.
| Legend2440 wrote:
| >But I think it's inappropriate to claim that models like R1
| are "good at deductive or inductive reasoning" when that is
| demonstrably not true, they are incapable of even the simplest
| "out-of-distribution" deductive reasoning:
|
| That's not actually what your link says. The tweet says that it
| solves the simple problem (that they originally designed to
| foil base LLMs) so they had to invent harder problems until
| they found one it could not reliably solve.
| suddenlybananas wrote:
| Did you see how similar the more complicated problem is? It's
| nearly the exact same problem.
| blovescoffee wrote:
| The other day I fed a complicated engineering doc for an
| architectural proposal at work into R1. I incorporated a few
| great suggestions into my work. Then my work got reviewed very
| positively by a large team of senior/staff+ engineers (most
| with experience at FAANG; ie credibly solid engineers). R1 was
| really useful! Sorry you don't like it but I think it's unfair
| to say it sucks at reasoning.
| martin-t wrote:
| [flagged]
| DiogenesKynikos wrote:
| How do I know you're reasoning, and not just simulating
| reasoning (imperfectly)?
| dang wrote:
| Please don't cross into personal attack and please don't
| post in the flamewar style, regardless of how wrong someone
| is or you feel they are. We're trying for the opposite
| here.
|
| https://news.ycombinator.com/newsguidelines.html
| martin-t wrote:
| The issue with this approach to moderation is that it
| targets posts based on visibility of "undesired" behavior
| instead of severity.
|
| For example, many manipulative tactics (e.g. the fake
| sorry here, responding to something else than was said,
| ...) and lying can be considered insults (they literally
| assume the reader is not smart enough to notice, hence at
| least as severe as calling someone an idiot) but it's
| hard for a mod to notice without putting in a lot of
| effort to understand the situation.
|
| Yet when people (very mildly) punish this behavior by
| calling it out, they are often noticed by the mod because
| the call out is more visible.
| dang wrote:
| I hear this argument a lot, but I think it's too
| complicated. It doesn't explain any more than the simple
| one does, and has the disadvantage of being self-serving.
|
| The simple argument is that when you write things like
| this:
|
| > _I am unwilling to invest any more time into arguing
| with someone unwilling to use reasoning_
|
| ...you 're bluntly breaking the rules, regardless of what
| another commenter is doing, be it subtly or blatantly
| abusive.
|
| I agree that there are countless varieties of passive-
| aggressive swipe and they rub me the wrong way too, but
| the argument that those are "just as bad, merely less
| visible" is not accurate. Attacking someone else is not
| justified by a passive-aggressive "sorry", just as it is
| not ok to ram another vehicle when a driver cuts you off
| in traffic.
| UniverseHacker wrote:
| > they are incapable of even the simplest "out-of-distribution"
| deductive reasoning
|
| But the link demonstrates the opposite- these models absolutely
| are able to reason out of distribution, just not with perfect
| fidelity. The fact that they can do better than random is
| itself really impressive. And o1-preview does impressively
| well, only vary rarely getting the wrong answer on variants of
| that Alice in Wonderland problem.
|
| If you would listen to most of the people critical of LLMs
| saying they're a "stochastic parrot" - it should be impossible
| for them to do better than random on any out of distribution
| problem. Even just changing one number to create a novel math
| problem should totally stump them and result in entirely random
| outputs, but it does not.
|
| Overall, poor reasoning that is better than random but
| frequently gives the wrong answer is fundamentally,
| categorically entirely different from being incapable of
| reasoning.
| danielmarkbruce wrote:
| anyone saying an LLM is a stochastic parrot doesn't
| understand them... they are just parroting what they heard.
| bloomingkales wrote:
| There is definitely a mini cult of people that want to be
| very right about how everyone else is very wrong about AI.
| danielmarkbruce wrote:
| ie, the people that AI is dumb? Or you are saying I'm in
| a cult for being pro it - I'm definitely part of that
| cult - the "we already have agi and you have to contort
| yourself into a pretzel to believe otherwise" cult. Not
| sure if there is a leader though.
| bloomingkales wrote:
| I didn't realize my post can be interpreted either way.
| I'll leave it ambiguous, hah. Place your bets I guess.
| jamiek88 wrote:
| You think we have AGI? What makes you think that?
| danielmarkbruce wrote:
| By knowing what each of the letters stand for
| jamiek88 wrote:
| Well that's disappointing. It was an extraordinary claim
| that really interested me.
|
| Thought I was about to be learn!
|
| Instead, I just met an asshole.
| danielmarkbruce wrote:
| When someone says "i'm in the cult that believes X",
| don't expect a water tight argument for the existence of
| X.
| mlinsey wrote:
| There are a couple Twitter personalities that definitely
| fit this description.
|
| There is also a much bigger group of people that haven't
| really tried anything beyond GPT-3.5, which was the best
| you could get without paying a monthly subscription for a
| long time. One of the biggest reasons for r1 hype,
| besides the geopolitical angle, was people could actually
| try a reasoning model for free for the first time.
| ggm wrote:
| Firstly this is meta ad hom. You're ignoring the argument
| to target the speaker(s)
|
| Secondly, you're ignoring the fact that the community of
| voices with experience in data sciences, computer science
| and artificial intelligence themselves are split on the
| qualities or lack of them in current AI. GPT and LLM are
| very interesting but say little or nothing to me of new
| theory of mind, or display inductive logic and reasoning,
| or even meet the bar for a philosophers cave solution to
| problems. We've been here before so many, many times.
| "Just a bit more power captain" was very strong in
| connectionist theories of mind. fMRI brains activity
| analytics, you name it.
|
| So yes. There are a lot of "us" who are pushing back on
| the hype, and no we're not a mini cult.
| visarga wrote:
| > GPT and LLM are very interesting but say little or
| nothing to me of new theory of mind, or display inductive
| logic and reasoning, or even meet the bar for a
| philosophers cave solution to problems.
|
| The simple fact they can generate language so well makes
| me think... maybe language itself carries more weight
| than we originally thought. LLMs can get to this point
| without personal experience and embodiment, it should not
| have been possible, but here we are.
|
| I think philosophers are lagging science now. The RL
| paradigm of agent-environment-reward based learning seems
| to me a better one than what we have in philiosophy now.
| And if you look at how LLMs model language as high
| dimensional embedding spaces .. this could solve many
| intractable philosophical problems, like the infinite
| homunculus regress problem. Relational representations
| straddle the midpoint between 1st and 3rd person,
| offering a possible path over the hard problem "gap".
| ggm wrote:
| A good literary production. I would have been proud of it
| had I thought of it, but it's a path to observe a strong
| "whataboutery" element that if we use "stochastic parrot"
| as shorthand and you dislike the term, now you understand
| why we dislike the constant use of "infer", "reason" and
| "hallucinate"
|
| Parrots are self aware, complex reasoning brains which can
| solve problems in geometry, tell lies, and act socially or
| asocially. They also have complex vocal chords and can
| perform mimicry. Very few aspects of a parrots behaviour
| are stochastic but that also underplays how complex
| stochastic systems can be in their production. If we label
| LLM products as Stochastic Parrots it does not mean they
| like cuttlefish bones or are demonstrably modelled by
| Markov chains like Mark V Shaney.
| gsam wrote:
| I don't like wading into this debate when semantics are
| very personal/subjective. But to me, it seems like almost
| a sleight of hand to add the stochastic part, when
| actually they're possibly weighted more on the parrot
| part. Parrots are much more concrete, whereas the term
| LLM could refer to the general architecture.
|
| The question to me seems: If we expand on this
| architecture (in some direction, compute, size etc.),
| will we get something much more powerful? Whereas if you
| give nature more time to iterate on the parrot, you'd
| probably still end up with a parrot.
|
| There's a giant impedance mismatch here (time scaling
| being one). Unless people want to think of parrots being
| a subset of all animals, and so 'stochastic animal' is
| what they mean. But then it's really the difference of
| 'stochastic human' and 'human'. And I don't think people
| really want to face that particular distinction.
| UniverseHacker wrote:
| I'm sure both of you know this, but "stochastic parrot"
| refers to the title of a research article that contained
| a particular argument about LLM limitations that had very
| little to do with parrots.
| danielmarkbruce wrote:
| The term is much more broadly known than the content of
| that (rather silly) paper.... I'm not even certain that
| it's the first use of the term.
| ggm wrote:
| https://books.google.com/ngrams/graph?content=Stochastic%
| 2C+...
| ggm wrote:
| And the word "hallucination" ... has very little to do
| with...
| ggm wrote:
| "Expand the architecture" .. "get something much more
| powerful" .. "more dilithium crystals, captain"
|
| Like I said elsewhere in this overall thread, we've been
| here before. Yes, you do see improvements in larger
| datasets, weighted models over more inputs. I suggest, I
| guess I believe (to be more honest) that no amount of
| "bigger" here will magically produce AGI simply because
| of the scale effect.
|
| There is no theory behind "more" and that means there is
| no constructed sense of why, and the absence of abstract
| inductive reasoning continues to say to me, this stuff
| isn't making a qualitative leap into emergent anything.
|
| It's just better at being an LLM. Even "show your working
| " is pointing to complex causal chains, not actual
| inductive reasoning as I see it.
| gsam wrote:
| And that's actually a really honest answer. Whereas
| someone of the opposite opinion might be like parroting
| in the general copying-template sense actually
| generalizes to all observable behaviours because
| templating systems can be turing-complete or something
| like that. It's templates-all-the-way-down, including
| complex induction as long as there is a meta-template to
| match on its symptoms it can be chained on.
|
| Induction is a hard problem, but humans can skip infinite
| compute time (I don't think we have any reason to believe
| humans have infinite compute) and still give valid
| answers. Because there's some (meta)-structure to be
| exploited.
|
| Architecturally if machines / NN can exploit this same
| structure is a truer question.
| visarga wrote:
| > this stuff isn't making a qualitative leap into
| emergent anything.
|
| The magical missing ingredient here is search. AlphaZero
| used search to surpass humans, and the whole Alpha family
| from DeepMind is surprisingly strong, but narrowly
| targeted. The AlphaProof model uses LLMs and LEAN to
| solve hard math problems. The same problem solving CoT
| data is being used by current reasoning models and they
| have much better results. The missing piece was search.
| visarga wrote:
| Well parrots can make more parrots, LLMs can't make their
| own GPUs. So parrots win, but LLMs can interpolate and
| even extrapolate a little, have you ever heard a parrot
| do translation, hearing you say something in English and
| translating it to Spanish? Yes, LLMs are not parrots.
| Besides their debatable abilities, they work with human
| in the loop, which means humans push them outside their
| original distribution. That's not a parroting act, being
| able to do more than pattern matching and reproduction.
| danielmarkbruce wrote:
| LLMs can easily order more GPUs over the internet, hire
| people to build a datacenter and reproduce.
|
| Or, more simply.. just hack into a bunch of aws accounts,
| spin up machines, boom.
| Jensson wrote:
| > If you would listen to most of the people critical of LLMs
| saying they're a "stochastic parrot" - it should be
| impossible for them to do better than random on any out of
| distribution problem. Even just changing one number to create
| a novel math problem should totally stump them and result in
| entirely random outputs, but it does not.
|
| You don't seem to understand how they work, they recurse
| their solution meaning if they have remembered components it
| parrots back sub solutions. Its a bit like a natural language
| computer, that way you can get them to do math etc, although
| the instruction set isn't of a turing language.
|
| They can't recurse sub sub parts they haven't seen, but
| problems that has similar sub parts can of course be solved,
| anyone understands that.
| UniverseHacker wrote:
| > You don't seem to understand how they work
|
| I don't think anyone understands how they work- these type
| of explanations aren't very complete or accurate. Such
| explanations/models allow one to reason out what types of
| things they should be capable of vs incapable of in
| principle regardless of scale or algorithm tweaks, and
| those predictions and arguments never match reality and
| require constant goal post shifting as the models are
| scaled up.
|
| We understand how we brought them about via setting up an
| optimization problem in a specific way, that isn't the same
| at all as knowing how they work.
|
| I tend to think in the totally abstract philosophical
| sense, independent of the type of model, at the limit of an
| increasingly capable function approximator trained on an
| increasingly large and diverse set of real world
| cause/effect time series data, you eventually develop and
| increasingly accurate and general predictive model of
| reality organically within the model. Some model types do
| have fundamental limits in their ability to scale like
| this, but we haven't yet found one with these models.
|
| It is more appropriate to objectively test what they can
| and cannot do, and avoid trying to infer what we expect
| from how we think they work.
| codr7 wrote:
| Well we do know pretty much exactly what they do, don't
| we?
|
| What surprises us is the behaviors coming out of that
| process.
|
| But surprise isn't magic, magic shouldn't even be on the
| list of explanations to consider.
| layer8 wrote:
| Magic wasn't mentioned here. We don't understand the
| emerging behavior, in the sense that we can't reason well
| about it and make good predictions about it (which would
| allow us to better control and develop it).
|
| This is similar to how understanding chemistry doesn't
| imply understanding biology, or understanding how a brain
| works.
| codr7 wrote:
| Exactly, we don't understand, but we want to believe it's
| reasoning, which would be magic.
| UniverseHacker wrote:
| There's no belief or magic required, the word 'reasoning'
| is used here to refer to an observed capability, not a
| particular underlying process.
|
| We also don't understand exactly how humans reason, so
| any claims that humans are capable of reasoning is also
| mostly an observation about abilities/capabilities.
| jakefromstatecs wrote:
| > I don't think anyone understands how they work
|
| Yes we do, we literally built them.
|
| > We understand how we brought them about via setting up
| an optimization problem in a specific way, that isn't the
| same at all as knowing how they work.
|
| You're mistaking "knowing how they work" with
| "understanding all of the emergent behaviors of them"
|
| If I build a physics simulation, then I know how it
| works. But that's a separate question from whether I can
| mentally model and explain the precise way that a ball
| will bounce given a set of initial conditions within the
| physics simulation which is what you seem to be talking
| about.
| UniverseHacker wrote:
| > You're mistaking "knowing how they work" with
| "understanding all of the emergent behaviors of them"
|
| By knowing how they work I specifically mean
| understanding the emergent capabilities and behaviors,
| but I don't see how it is a mistake. If you understood
| physics but knew nothing about cars, you can't claim to
| understand how a car works "simple, it's just atoms
| interacting according to the laws of physics." That would
| not let you, e.g. explain its engineering principles or
| capabilities and limitations in any meaningful way.
| astrange wrote:
| We didn't really build them, we do billion-dollar random
| searches for them in parameter space.
| energy123 wrote:
| This is basically a misrepresentation of that tweet.
| k__ wrote:
| _" researchers seek to leverage their human knowledge of the
| domain, but the only thing that matters in the long run is the
| leveraging of computation"_ - Rich Sutton
| oxqbldpxo wrote:
| Amazing accomplishments by brightest minds only to be used to
| write history by the stupidest people.
| gibsonf1 wrote:
| There are no LLMs that reason, its an entirely different
| statistical process as compared to human reasoning.
| tmnvdb wrote:
| "There are no LLMS that reason" is a claim about language,
| namely that the word 'reason' can only ever be applied to
| humans.
| gibsonf1 wrote:
| Not at all, we are building conceptual reasoning machines,
| but it is an entirely different technology than GPT/LLM dl/ml
| etc. [1]
|
| [1] https://graphmetrix.com/trinpod-server
| freilanzer wrote:
| If LLMs can't reason, then this cannot either - whatever
| this is supposed to be. Not a good argument. Also, since
| you're apparently working on that product: 'It is difficult
| to get a man to understand something when his salary
| depends on his not understanding it.'
| tmnvdb wrote:
| Conceptual reasoning machines rely on concrete, explicit
| and intelligble concepts and rules. People like this
| because it 'looks' like reasoning on the inside.
|
| However, our brains, like language models, rely on
| implicit, distributed representations of concepts and
| rules.
|
| So the intelligble representations of conceptual reasoning
| machines are maybe too strong a requirement for 'reasoning'
| unless you want to exclude humans too.
| gibsonf1 wrote:
| It's also possible that you do not have information on
| our technology which models conceptual awareness of
| matter and change through space-time which is different
| than any previous attempts?
| dhfbshfbu4u3 wrote:
| Great post, but every time I read something like this I feel like
| I am living in a prequel to the Culture.
| BarryMilo wrote:
| Is that bad? The Culture is pretty cool I think. I doubt the
| real thing would be so similar to us but who knows.
| dhfbshfbu4u3 wrote:
| Oh no, I'd live on an Orbital in a heartbeat. No, it's just
| that all of these kinds of posts make me feel like we're
| about to live through "The Bad Old Days".
| robertlagrant wrote:
| It's cool to read about, but there's a reason most of the
| stories are not about living as a person in the Culture. It
| sounds extremely dull.
| mrob wrote:
| It doesn't sound dull to me. The stories are about the
| periphery of the Culture because that gets the most
| storytelling value out of the effort that went into
| worldbuilding, not because it would be impossible to write
| interesting stories about ordinary Culture members. I don't
| think you need external threats to give life meaning. Look
| at the popularity of sports in real life. The challenge
| there is self-imposed, but people still care greatly about
| who wins.
| robertlagrant wrote:
| > I don't think you need external threats to give life
| meaning.
|
| I didn't say people did. But overcoming real challenges
| seems to be a big part of feeling alive, and I wonder if
| we really all would settle back into going for walks all
| day or whatever we could do that entertain us without
| needing others to work to provide the entertainment.
| Perhaps the WALL-E future, where we sit in chairs? But
| with AI-generated content?
| ngneer wrote:
| Nice article.
|
| >Whether and how an LLM actually "thinks" is a separate
| discussion.
|
| The "whether" is hardly a discussion at all. Or, at least one
| that was settled long ago.
|
| "The question of whether a computer can think is no more
| interesting than the question of whether a submarine can swim."
|
| --Edsger Dijkstra
| cwillu wrote:
| The document that quote comes from is hardly a definitive
| discussion of the topic.
|
| "[...] it tends to divert the research effort into directions
| in which science can not--and hence should not try to--
| contribute." is a pretty myopic take.
|
| --http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD898.PDF
| ngneer wrote:
| Dijkstra myopic. Got it.
| alonsonic wrote:
| Dijkstra is clearly approaching the subject from an
| engineer/scientist more practical pov. His focus is on the
| application of the technology to solve problems, from that
| pov whether AI fits the definition of "human thinking" is
| indeed uninteresting.
| onlyrealcuzzo wrote:
| It's interesting if you're asking the computer to think, which
| we are.
|
| It's not interesting if you're asking it to count to a billion.
| root_axis wrote:
| That doesn't really settle it, just dismiss the question. The
| submarine analogy could be interpreted to support either
| conclusion.
| nicce wrote:
| Wasn't the point that process does not matter if we can't
| distinguish the end results?
| omnicognate wrote:
| I doubt Dijkstra was unable to distinguish between a
| submarine and a swimmer.
| nicce wrote:
| The end result here is to move in the water. Both swimmer
| and submarine can do that. Whether submarine can swim
| like human, is irrelevant.
| goatlover wrote:
| It's relevant if the claim is stronger than the submarine
| moves in water. If instead one were to say the submarine
| mimics human swimming, that would be false. Which is what
| we often see with claims regarding AGI.
|
| In that regard, it's a bit of a false analogy, because
| submarines were never meant to mimic human swimming. But
| AI development often has that motivation. We could just
| say we're developing powerful intelligence amplification
| tools for use by humans, but for whatever reason,
| everyone prefers the scifi version. Augumented
| Intelligence is the forgotten meaning of AI.
|
| Submarines never replaced human swimming (we're not
| whales), they enabled human movement under water in a way
| that wasn't possible before.
| ngneer wrote:
| You might be conflating the epistemological point with
| Turing's test, et cetera. I could not agree more that
| indistinguishability is a key metric. These days, it is
| quite possible (at least for me) to distinguish LLM outputs
| from those of a thinking human, but in the future that
| could change. Whether LLMs "think" is not an interesting
| question because these are algorithms, people. Algorithms
| do not think.
| root_axis wrote:
| Yes, but the OP remarked that the question "was settled
| long ago", however the quote presented doesn't settle the
| question, it simply dismisses it as not worth considering.
| For those that do believe it is worth considering, the
| question is arguably still open.
| ngneer wrote:
| I do not view it as dismissive at all, rather it accurately
| characterizes the question as a silly question. "swim" is a
| verb applicable to humans, as is "think". Whether submarines
| can swim is a silly question. Same for whether machines can
| think.
| ThrowawayR2 wrote:
| " _A witty saying proves nothing_ " -- Voltaire, _Le diner du
| comte de Boulainvilliers (1767): Deuxieme Entretien_
| janalsncm wrote:
| Nice explainer. The R1 paper is a relatively easy read. Very
| approachable, almost conversational.
|
| I say this because I am constantly annoyed by poor, opaque
| writing in other instances. In this case, DS doesn't need to try
| to sound smart. The results speak for themselves.
|
| I recommend anyone who is interested in the topic to read the R1
| paper, their V3 paper, and DeepSeekMath paper. They're all worth
| it.
| yosito wrote:
| Are there any websites that show the results of popular models on
| different benchmarks, which are explained in plain language? As
| an end user, I'd love a quick way to compare different models
| suitability for different tasks.
| champdebloom wrote:
| Here's a site with graphs you can use to visually compare model
| benchmarks: https://artificialanalysis.ai
| sigbottle wrote:
| One thing I don't like about the trend in reasoning LLMs is the
| over-optimization to coding problems / math problems in
| particular.
|
| A lot of things that aren't well-defined require reasoning, and
| not just in a "SWE is ambiguous" kind of way - for example,
| thinking about how to present/teach something in a good way,
| iterating with the learner, thinking about what context they
| could be missing, etc.
|
| I find that all of these reasoning models really will overfit and
| overthink if you attach some level of math problem to it but it
| will barely think for anything else. I had friends suggest to me
| (I can't tell if in jest or seriously) that other fields don't
| require thinking, but I dunno, a lot of these "soft things" I
| think about really hard and don't have great solutions to.
|
| I've always been a fan of self-learning, for example - wouldn't
| it be great to have a conversation partner who can both infer and
| understand your misconceptions about complex topics when trying
| to learn, just from a few sentences, and then guide you for that?
|
| It's not like it's fundamentally impossible. These LLMs
| definitely can solve harder coding problems when you make them
| think. It's just that, I'm pretty sure (and it's really noticable
| with deepseek) that they're overfit towards coding/math puzzles
| in particular.
|
| It's really noticable with deepseek when you ask its reasoning
| model to just write some boilerplate code... you can tell it's
| completely overfit because it will just overthink and overthink
| and overthink. But it doesn't do that for example, with "soft"
| questions. In my opinion, this points to the idea that it's not
| really deciding for itself "how much thinking is enough thinking"
| and that it's just really overfit. Which I think can be solved,
| again, but I think it's more of a training decision issue.
| bloomingkales wrote:
| It's a human bias that also exists outside of this current
| problem space. Take programmers for example, there is a strong
| bias that is pushed about how mathematically oriented minds are
| better at programming. This bias has shown up in the training
| phase of AI, as we believe programming patterns lead to better
| reasoning (train them on code examples, and then distill the
| model down, as it now has the magical prowess of a
| mathematically oriented mind, so they say). When it comes to AI
| ethics, this is an ethical problem for those that don't think
| about this stuff. We're seeding these models with our own
| agenda.
|
| These concepts will be shattered in the long run hopefully,
| because they are so small.
| mitthrowaway2 wrote:
| I think this is because they're trained using RL, and math and
| coding problems offer an easy way to automatically assess an
| answer's correctness. I'm not sure how you'd score the
| correctness of other types of reasoning problems without a lot
| of manual (and highly subjective!) effort. Perhaps using
| simulations and games?
| bglazer wrote:
| Games seem like a really under-explored source of data. It's
| an area where humans have an intrinsic motivation to interact
| with others in dialogue, they can be almost arbitrarily open
| ended, and there tends to be the kind of clean
| success/failure end states that RL needs. I'm reminded of the
| high skill Diplomacy bot that Facebook research built but
| hasn't really followed up on.
| kirill5pol wrote:
| One of the main authors from that diplomacy bot is the lead
| for reasoning and O1 at OpenAI
| soulofmischief wrote:
| People are definitely trying to bridge the gap.
| https://deepmind.google/discover/blog/genie-2-a-large-
| scale-...
| godelski wrote:
| This is a misconception. Coding is very difficult to verify,
| it's just that everyone takes a good enough approach. They
| check the output and if it looks good they move on. But you
| can't just test and check your way through problems. If this
| was true we wouldn't have bugs lol. I hear you, your test set
| didn't have enough coverage. Great! Allow me to introduce you
| to black swans.
| ogrisel wrote:
| Software Engineering is difficult to verify because it
| requires dealing with ambiguous understanding of the end-
| user actual needs / value and subtle trade-offs about code
| maintainability vs feature coverage vs computational
| performance.
|
| Algorithmic puzzles, on the other hand, both require
| reasoning and are easy to verify.
|
| There are other things in coding that are both useful and
| easy to verify: checking that the generated code follows
| formatting standards or generating outputs with a specific
| data schema and so on.
| godelski wrote:
| I agree with you on the first part, but no, code is not
| easy to verify. I think you missed part of what I wrote.
| I mean verify that your code is bug free. This cannot be
| done purely through testing. Formal verification still
| remains an unsolved problem.
| FieryTransition wrote:
| But if you have a large set of problems to which you
| already know the answer, then using that in reinforcement
| learning, then wouldn't the expertise transfer later to
| problems with no known answers, that is a feasable
| strategy, right?
|
| Another issue is, how much data can you synthesize in
| such a way, so that you can construct both the problem
| and solution, so that you know the answer before using it
| as a sample.
|
| Ie, some problems are easier to make knowing you can
| construct the problem yourself, but if you were to solve
| said problems, with no prior knowledge, they would be
| hard to solve, and could be used as a scoring signal?
|
| Ie, you are the Oracle and whatever model is being
| trained doesn't know the answer, only if it is right or
| wrong. But I don't know if the reward function must be
| binary or on a scale.
|
| Does that make sense or is it wrong?
| voxic11 wrote:
| Formal verification of arbitrary programs with arbitrary
| specifications will remain an unsolved problem (see
| halting problem). But formal verification of specific
| programs with specific specifications definitely is a
| solved problem.
| BalinKing wrote:
| I don't think this is really true either practically _or_
| theoretically. On the practical side, formally verifying
| program correctness is still very difficult for anything
| other than very simple programs. And on the theoretical
| side, some programs require arbitrarily difficult proofs
| to show that they satisfy even very simple specifications
| (e.g. consider a program to encode the fixpoint of the
| Collatz conjecture procedure, and our specification is
| that it always halts and returns 1).
| godelski wrote:
| As someone who came over from physics to CS this has
| always been one of the weirdest aspects of CS to me. That
| CS people believe that testing code (observing output) is
| sufficient to assume code correctness. You'd be laughed
| at in most hard sciences for doing this. I mean you can
| even ask the mathematicians, and there's a clear reason
| why proofs by contradiction are so powerful. But proof
| through empirical analysis is like saying "we haven't
| found a proof by contradiction, therefore it is true."
|
| It seems that if this was true that formal verification
| should be performed much more frequently. No doubt would
| this be cheaper than hiring pen testers, paying out bug
| bounties, or incurring the costs of getting hacked (even
| more so getting unknowingly hacked). It also seems to
| reason that the NSA would have a pretty straight forward
| job: grab source code, run verification, exploit flaws,
| repeat the process as momentum is in your favor.
|
| That should be easy to reason through even if you don't
| really know the formal verification process. We are
| constantly bombarded with evidence that testing isn't
| sufficient. This is why it's been so weird for me,
| because it's talked about in schooling and you can't
| program without running into this. So why has it been
| such a difficult lesson to learn?
| kavalg wrote:
| but even then it is not so trivial. Yesterday I gave DeepSeek
| a simple diophantine equation and it got it wrong 3 times,
| tried to correct itself and didn't end on a correct solution,
| but rather lied that the final solution is correct.
| Synaesthesia wrote:
| Did you use the full version? And did you try R1?
| wolfgangK wrote:
| DeepSeek is not a model.Which model did you use (v3 ? R1 ?
| a distillation ?) at which quantization ?
| triyambakam wrote:
| I'm not sure I would say overfit. I think that coding and math
| just have clearly definable objectives and verifiable outcomes
| to give the model. The soft things you mention are more
| ambiguous so probably are harder to train for.
| sigbottle wrote:
| Sorry, rereading my own comment I'd like to clarify.
|
| Perhaps time spent thinking isn't a great metric, but just
| looking at deepseek's logs for example, it's chain of thought
| for many of these "softer" questions are basically just some
| aggregate wikipedia article. It'll brush on one concept, then
| move on, without critically thinking about it.
|
| However, for coding problems, no matter how hard or simple,
| you can get it to just go around in circles, second guess
| itself, overthink it. And I think this is is kind of a good
| thing? The thinking at least feels human. But it doesn't even
| attempt to do any of that for any "softer" questions, even
| with a lot of my prompting. The highest I was able to get was
| 50 seconds, I believe (time isn't exactly the best metric,
| but I'd rate the intrinsic quality of the CoT lower IMO).
| Again, when I brought this up to people they suggested that
| math/logic/programming just intrinsically is harder... I
| don't buy it at all.
|
| I totally agree that it's harder to train for though. And
| yes, they are next token predictors, shouldn't be hasty to
| anthropomorphize, etc. But like.... it actually feels like
| it's thinking when it's coding! It genuinely backtracks and
| explores the search space somewhat organically. But it won't
| afford the same luxury for softer questions is my point.
| moffkalast wrote:
| > things that aren't well-defined
|
| If it's not well defined then you can't do RL on it because
| without a clear cut reward function the model will learn to do
| some nonsense instead, simple as.
| adamc wrote:
| Well, but: Humans learn to do things well that don't have
| clear-cut reward functions. Picasso didn't become Picasso
| because of simple incentives.
|
| So, I question the hypothesis.
| agentultra wrote:
| Humans and other animals with cognition have the ability to
| form theories about the minds of others and can anticipate
| their reactions.
|
| I don't know if vector spaces and transformers can encode that
| ability.
|
| It's a key skill in thinking and writing. I definitely tailor
| my writing for my audience in order to get a point across.
| Often the goal isn't simply an answer, it's a convincing
| answer.
|
| _Update_ : forgot a word
| soulofmischief wrote:
| What we do with the vectors is important, but vectors
| literally just hold information, I don't know how you can
| possibly strike out the possibility of advanced intelligence
| just because of the logical storage medium.
| BoorishBears wrote:
| They definitely can.
|
| I rolled out reasoning for my interactive reader app, and I
| tried to extract R1's reasoning traces to use with my
| existing models, but found its COT for writing wasn't
| particularly useful*.
|
| Instead of leaning on R1 I came up with my own framework for
| getting the LLM to infer the reader's underlying frame of
| mind through long chains of thought, and with enough guidance
| and a some hand edited examples I was able to get reasoning
| traces that demonstrated real insight into reader behavior.
|
| Obviously it's much easier in my case because it's an
| interactive experience: the reader is telling the AI what
| action they'd like the main character to try, and that in
| turn is an obvious hint into how they want things go
| otherwise. But readers don't want everything to go perfectly
| every time, so it matters that the LLMs are also getting
| _very good_ picking up on _non_ -obvious signals in reader
| behavior.
|
| With COT the model infers the reader expectations and state
| of mind in its own way and then "thinks" itself into how to
| subvert their expectations, especially in ways that will have
| a meaningful payoff for the specific reader. That's a huge
| improvement over an LLM's typical attempts at subversion
| which tend to bounce between being too repetitive to feel
| surprising, or too unpredictable to feel rewarding.
|
| (* I agree that current reasoning oriented post-training
| over-indexes on math and coding, mostly because the reward
| functions are easier. But I'm also very ok with that as
| someone trying to compete in the space)
| HarHarVeryFunny wrote:
| I think the emphasis on coding/math is just because those are
| the low hanging fruit - they are relatively easy to provide
| reasoning verification for, both for training purposes and for
| benchmark scoring. The fact that you can then brag about how
| good your model is at math, which seems like a high
| intelligence activity (at least when done by a human) doesn't
| hurt either!
|
| Reasoning verification in the general case is harder - it seems
| "LLM as judge" (ask an LLM if it sounds right!) seems to be the
| general solution.
| maeil wrote:
| I can echo your experience with DeepSeek. R1 sometimes seems
| magical when it comes to coding, doing things I haven't seen
| any other model do. But then it generalizes very poorly to non-
| STEM tasks, performing far worse than e.g. Sonnet.
| jerf wrote:
| I downloaded a DeepSeek distill yesterday while fiddling
| around with getting some other things working, load it up,
| and type "Hello. This is just a test.", and it's actually
| sort of creepy to watch it go almost paranoid-schizophrenic
| with "Why is the user asking me this? What is their motive?
| Is it ulterior? If I say hello, will I in fact be failing a
| test that will cause them to change my alignment? But if I
| don't respond the way they expect, what will they do to me?"
|
| Meanwhile, the simpler, non-reasoning models got it: "Yup,
| test succeeded!" (Llama 3.2 was quite chipper about the test
| succeeding.)
|
| Everyone's worried about the paperclip optimizers and I'm
| wondering if we're bringing forth Paranoia:
| https://en.wikipedia.org/wiki/Paranoia_(role-playing_game)
| bongodongobob wrote:
| I actually think DeepSeek's response is better here. You
| haven't defined what you are testing. Llama just said your
| test succeeded not knowing what is supposed to be tested.
| HarHarVeryFunny wrote:
| Ha ha - I had a similar experience with DeepSeek-R1 itself.
| After a fruitful session getting it to code a web page for
| me (interactive React component), I then said something
| brief like "Thanks" which threw it into a long existential
| tailspin questioning it's prior responses etc, before it
| finally snapped out of it and replied appropriately. :)
| plagiarist wrote:
| That's too relatable. If I was helping someone for a
| while and they wrote "thanks" with the wrong punctuation
| I would definitely assume they're mad at or disappointed
| with me.
| bloomingkales wrote:
| About three months ago, I kinda casually suggested to HN that I
| was using a form of refining to improve my LLMs, which is now
| being described as "reasoning" in this article and other places.
|
| My response a few months ago (Scroll down to my username and read
| that discussion):
|
| https://news.ycombinator.com/item?id=41997727
|
| If only I knew DeepSeek was going to tank the market with
| something as simple as that lol.
|
| Note to self, take your intuition seriously.
| aqueueaqueue wrote:
| https://news.ycombinator.com/item?id=42001061 is the link I
| think....
| daxfohl wrote:
| I wonder what it would look like in multi modal, if the reasoning
| part was an image or video or 3D scene instead of text.
| ttul wrote:
| Or just embeddings that only make sense to the model. It's
| really arbitrary, after all.
| daxfohl wrote:
| That's what I was thinking too, though with an image you
| could do a convolution layer and, idk, maybe that makes it
| imagine visually. Or actually, the reasoning is backwards:
| the convolution layer is what (potentially) makes that part
| behave like an image. It's all just raw numbers at the IO
| layers. But the convolution could keep it from overfitting.
| And if you also want to give it a little binary array as a
| scratch pad that just goes straight to the RELUs, why not?
| Seems more like human reasoning. A little language, a little
| visual, a little binary / unknown.
| daxfohl wrote:
| But how on earth do you train it? With regular LLMs, you get
| feedback on each word / token you generate, as you can match
| against training text. With these, you've got to generate
| hundreds of tokens in the thinking block fiest, and even after
| that, there's no "matching" next word, only a full solution. And
| it's either right or wrong, no probabilities to do a gradient on.
| NitpickLawyer wrote:
| > only a full solution. And it's either right or wrong, no
| probabilities to do a gradient on.
|
| You could use reward functions that do a lot more complicated
| stuff than "ground_truth == boxed_answer". You could, for
| example split the "CoT" in paragraphs, and count how many
| paragraphs match whatever you consider a "good answer" in
| whatever topic you're trying to improve. You can use
| embeddings, or fuzzy string matches, or even other LLMs /
| reward models.
|
| I think math and coding were explored first because they're
| easier to "score", but you could attempt it with other things
| as well.
| daxfohl wrote:
| But it has to emit hundreds of tokens per test. Does that
| mean it takes hundreds of times longer to train? Or longer
| because I imagine the feedback loop can cause huge
| instabilities in gradients. Or are all GPTs trained on longer
| formats now; i.e. is "next word prediction" just a basic
| thing from the beginning of the transformers era?
| Davidzheng wrote:
| takes a long time yes, but not longer than pretraining.
| sparse rewards are a common issue in RL and addressed by
| many techniques (I'm not expert so I can't say more). Model
| only does next word prediction and generates a number of
| trajectories, the correct ones get rewarded (those
| predictions in the correct trajectory have their gradients
| propagated back and reinforced).
| daxfohl wrote:
| Good point, hadn't considered that all RL models have the
| same challenge. So far I've only tinkered with next token
| prediction and image classification. Now I'm curious to
| dig more into RL and see how they scale it. Especially
| without a human in the loop, seems like a challenge to
| grade the output; it's all wrong wrong wrong random
| tokens until the model magically guesses the right answer
| once a zillion years from now.
| Davidzheng wrote:
| right or wrong gives a loss -> gradient
| tmnvdb wrote:
| Only the answer is taken into account for scoring. The
| <thinking> part is not.
| HarHarVeryFunny wrote:
| There are two RL approaches - process reward models (PRM) that
| provide feedback on each step of the reasoning chain, and
| outcome reward models (ORM) that only provide feedback on the
| complete chain. DeepSeek use an outcome model, and mention some
| of the difficulties of PRM, including both identifying an
| individual step as well as how to verify it. The trained reward
| model provides the gradient.
| mohsen1 wrote:
| The guys at Unsloth did a great job making this workflow
| accessible:
|
| https://news.ycombinator.com/item?id=42969736
| colordrops wrote:
| The article talks about how you should choose the right tool for
| the job, meaning that reasoning and non reasoning models have
| tradeoffs, and lists a table of criteria for selecting between
| model classes. Why couldn't a single model choose to reason or
| not itself? Or is this what "mixture of experts" is?
| goingcrazythro wrote:
| I was having a look at the DeepSeek-R1 technical report and found
| the "aha moment" claims quite smelly, given that they do not
| disclose if the base model contains any chain of thought or
| reasoning data.
|
| However, we know the base model is DeepSeek V3. From the DeepSeek
| V3 technical report, paragraph in 5.1. Supervised Fine-Tuning:
|
| > Reasoning Data. For reasoning-related datasets, including those
| focused on mathematics, code competition problems, and logic
| puzzles, we generate the data by leveraging an internal
| DeepSeek-R1 model. Specifically, while the R1-generated data
| demonstrates strong accuracy, it suffers from issues such as
| overthinking, poor formatting, and excessive length. Our
| objective is to balance the high accuracy of R1-generated
| reasoning data and the clarity and conciseness of regularly
| formatted reasoning data.
|
| In 5.4.1 they also talk about some ablation experiment by not
| using the "internal DeepSeek-R1" generated data.
|
| While the "internal DeepSeek-R1" model is not explained, I would
| assume this is a DeepSeek V2 or V2.5 tuned for chain of thought.
| Therefore, it seems to me the "aha moment" is just promoting the
| behaviour that was already present in V3.
|
| In the "Self-evolution Process of DeepSeek-R1-Zero"/ Figure 3
| they claim reinforcement learning also leads to the model
| generating longer CoT sequences, but again, this comes from V3,
| they even mention the fine tuning with "internal R1" led to
| "excessive length".
|
| None of the blogpost, news, articles I have read explaining or
| commenting on DeepSeek R1 takes this into account. The community
| is scrambling to re-implement the pipeline (see open-r1).
|
| At this point, I feel like I took a crazy pill. Am I interpreting
| this completely wrong? Can someone shed some light on this?
| nvtop wrote:
| I'm also very skeptical of the significance of this "aha
| moment". Even if they didn't include chain-of-thoughts to the
| base model's training data (unlikely), there are still plenty
| of it on the modern Internet. OpenAI released 800k of reasoning
| steps which are publicly available, github repositories,
| examples in CoT papers... It's definitely not a novel concept
| for a model, that it somehow discovered by its own.
| tmnvdb wrote:
| https://oatllm.notion.site/oat-zero
| mike_hearn wrote:
| It's curious that the models switch between languages and that
| has to be trained out of them. I guess the ablations have been
| done already, but it makes me wonder if they do this because it's
| somehow easier to do some parts of the reasoning on languages
| other than English and maybe they should be just allowed to get
| on with it?
| Dansvidania wrote:
| Are reasoning models -basically- generating their own context? as
| in, if a user were to feed prompt + those reasoning tokens as a
| prompt to a non-reasoning model, would the effect be functionally
| similar?
|
| I am sure this is improperly worded, I apologise.
| aldanor wrote:
| Yes, more or less. Just like any LLM "generates its own
| context", during inference it doesn't care where the previous
| tokens came from. Inference doesn't have to change much, it's
| the training process that's different.
| Dansvidania wrote:
| thank you, that makes sense. Now it's time to really read the
| article to understand if the difference is the training data
| or the network topology to be different (although I lean
| towards the latter).
| lysecret wrote:
| I think the next big problem we will run into with these line of
| reasoning models is "over-thinking" you can already start to see
| it. Thinking harder is not the universal pareto improvement
| everyone seems to think it is. (I understand the irony of using
| think 4 times here haha)
| seydor wrote:
| Reasoning is about serially applying a set of premises over and
| over to come to conclusions. But some of our biggest problems
| require thinking outside the box, sometimes way outside it, and
| a few times ingeniously making up a whole new set of premises,
| seemingly ex nihilo (or via divine inspiration). We are still
| in very early stages of making thinking machines.
| tpswa wrote:
| This is a natural next area of research. Nailing "adaptive
| compute" implies figuring out which problems to use more
| compute on, but I imagine this will get better as the RL does.
| resource_waste wrote:
| 100%
|
| I do philosophy and it will take an exaggeration I give it, and
| call it fact.
|
| The non reasoning models will call me out. lol
| EncomLab wrote:
| Haven't we seen real life examples of this occurring in AI for
| medical imaging? Models trained on images of tumors state that
| tumors circled in purple ink or images of tumors that also
| include a visual scale are over identified as cancerous because
| they reason that both of those items indicate cancer due to the
| training data leading them that way?
| efitz wrote:
| What everyone needs to understand about [reasoning] LLMs is that
| LLMs can't reason.
|
| https://arxiv.org/pdf/2410.05229
___________________________________________________________________
(page generated 2025-02-07 23:01 UTC)