[HN Gopher] Understanding R1-Zero-Like Training: A Critical Pers...
___________________________________________________________________
Understanding R1-Zero-Like Training: A Critical Perspective
Author : pama
Score : 147 points
Date : 2025-03-22 14:35 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| scribu wrote:
| If the base models already have the "reasoning" capability, as
| they claim, then it's not surprising that they were able to get
| to SOTA using a relatively negligible amount of compute for RL
| fine-tuning.
|
| I love this sort of "anti-hype" research. We need more of it.
| mirekrusin wrote:
| So they achived R1-Zero like performance without those long CoT
| that sometimes never end/are impacting inference time with
| fraction of fine tunining resources?
| refulgentis wrote:
| No, they still have "<think>", but it's shorter by removing
| part of a term.
| mirekrusin wrote:
| That's what I mean, those CoT are never ending currently
| until you run out of context.
| refulgentis wrote:
| I'm not sure if you're talking conversationally and I'm
| taking it as a technical query, or you're saying CoT never
| terminate for you and asking for input, or asking what the
| paper implies about CoT, or relaying that you understand
| the papers claim that this method net reduces CoT length.
| mirekrusin wrote:
| Verbose, non terminating CoT is currently common problem
| with open weight models based on R1-zero methods.
|
| Currently it seems that this shift of cost to inference-
| time is a necessary tradeoff that we have to live with
| (for now at least).
|
| It's more of a problem for many people running those
| models locally because they have constrained hardware
| that can't handle those long contexts.
|
| It seems that this paper shows not only that their method
| is cheaper in terms of fine tuning but also significantly
| reduces inference time cost for CoTs.
|
| If what they say gets confirmed it looks to me like quite
| significant contribution?
| drakenot wrote:
| I've seen the same "Superficial Self-Reflection" mentioned in
| their linked blog post[0] as well, where the conclusion doesn't
| naturally follow the output of the thinking tokens. I think
| people are fooled by this, but if you take the time to inspect
| the "chain of thought" tokens they often don't match the final
| output answer.
|
| I don't deny that performance for certain logic tasks goes up
| with these models but I don't fully understand what role the
| thinking tokens take in these cases.
|
| [0] https://oatllm.notion.site/oat-zero
| andai wrote:
| I heard that even just getting the model to print a bunch of
| whitespace ("think for longer") improves the quality of the
| final response, because some kind of processing is still
| happening internally?
| MoonGhost wrote:
| Could it be that model just uses latent space for thinking
| while generating almost garbage? Interesting to check if
| adding repeating something at the end of prompt helps. I.e.
| model uses it for 'thinking'.
| refulgentis wrote:
| Latent space has a certain shape, which may mean I'm
| missing a technical distinction.*
|
| There's been publications with a pause token
| (https://arxiv.org/abs/2310.02226), backspace token
| (https://arxiv.org/abs/2306.05426), or a think token
| (https://arxiv.org/html/2405.08644v1#Ch0.S4), all of them
| based on the theory that a generic token can sort of act a
| placeholder for manipulating attention further without
| further meaningful output.
|
| However, in practice, those approaches haven't been used in
| the training of a large scale model, i.e. I haven't seen it
| at all, the most adventurous people have gotten at scale is
| doing Mamba. (and RL)
|
| * It had a particular technical meaning. The first round of
| the telephone game was when it came to mean "a 3 spatial
| dimensions-like space, with N dimensions, in an image
| diffusion model, that contains all possible image styles,
| that is navigated by a prompt." We're many iterations
| afield of it now, I'm afraid. Now, you sort of have to
| interpret it like you would negative space, defined by what
| it is around it.
| integralof6y wrote:
| printing a bunch of whitespace is a way of entering into a
| new state ( I am thinking about a state machine), so the LLM
| can use that whitespace as a new token that can be used later
| to refine the state of the system. In math terms, whitespace
| is a tag for a class (or state) in the LLM. I think that
| perhaps RL can take advantage of such tags. For example
| whitespace could indicate a point of low gradient
| (indetermination) or a branching point, the LLM in some way
| would learn to enhance the learning rate parameter, so the
| message in the head of the LLM is: be ready to learn from RL
| because in your actual state you need to take a branch from a
| branching point that can enhance your capabilities. This is
| similar to tossing a coin or a die. The rule could be: when
| whitespace do increase the learning rate parameter to escape
| from zero gradient points. Caveat emptor: This is just an
| speculation, I don't have any data to support this
| hypothesis. Also this suggests that whitespace could be a
| "token that reflects the state of previous layers" and is not
| contained in the vocabulary used to train the model, so I
| should say that whitespace is a macro-token or neurotoken. If
| this hypothesis has some ground then it could also be
| plausible that whitespace could be an enumerate neural tag in
| the sense that the length of whitespace reflects or is
| related to the layer in which the zero gradient or branching
| point occurs. Finally, my throwaway user need whitespace so I
| will change the password to a random one to force myself to
| avoid adding new ideas.
| andai wrote:
| Here's one where just appending dots to the output improved
| quality:
|
| https://arxiv.org/abs/2404.15758
| YeGoblynQueenne wrote:
| More tokens = fewer options for the final string. It's not any
| more complicated than that and it doesn't require any
| reasoning, just an autoregressively trained statistical model
| of text, but no, it has to be "the model thinks harder if it
| outputs more tokens".
| mentalgear wrote:
| Overall the industry needs more review, less hype. I was shocked
| to find out SWE-verified [0] is all but verified.
|
| [0] benchmark used by all major vendors to "showcase" coding
| ability, turns out to be <10% properly solved:
| https://www.youtube.com/watch?v=QnOc_kKKuac
| belter wrote:
| Failure modes are also interesting to show what is happening or
| not really happening. Like the test of asking GenAI to create
| clocks at specific times, or people drawing with the left hand.
| All you get are clocks, at 10 min past two, or people drawing
| with the right hand, since it's 99% of what is in the training
| data.
|
| Like Sabine says, if the LLM models, already read all the Math
| books in the world but are not yet able to do basic math,
| without calling upon a calculator, how much reasoning is really
| emerging?
|
| "The Path to AGI is Coming Into View":
| https://youtu.be/mfbRHhOCgzs?t=219
| refulgentis wrote:
| I'm not sure what Sabine means. It is a somewhat obvious
| category error, and fundamentally in error regardless. (I
| find it hard to believe that, for example, Sabine would beat
| an LLM on a randomly selection of 10 3x3 digit multiplication
| problem to be complicated in 60 seconds max, by either party)
| immibis wrote:
| Or overflowing wine glasses.
| fragmede wrote:
| numeracy isn't mathmatical reasoning
| fancyfredbot wrote:
| The article starts by saying
|
| "DeepSeek-V3-Base already exhibit 'Aha moment'."
|
| I tried to read the screenshot they present as evidence of this,
| and indeed it does say "Aha!". But both the preceding reasoning
| and the following conclusion look like gibberish to me. I'm not
| sure what we're supposed to conclude here and I gave up reading
| the article after this inauspicious start.
___________________________________________________________________
(page generated 2025-03-23 23:02 UTC)