[HN Gopher] Understanding R1-Zero-Like Training: A Critical Pers...
       ___________________________________________________________________
        
       Understanding R1-Zero-Like Training: A Critical Perspective
        
       Author : pama
       Score  : 147 points
       Date   : 2025-03-22 14:35 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | scribu wrote:
       | If the base models already have the "reasoning" capability, as
       | they claim, then it's not surprising that they were able to get
       | to SOTA using a relatively negligible amount of compute for RL
       | fine-tuning.
       | 
       | I love this sort of "anti-hype" research. We need more of it.
        
       | mirekrusin wrote:
       | So they achived R1-Zero like performance without those long CoT
       | that sometimes never end/are impacting inference time with
       | fraction of fine tunining resources?
        
         | refulgentis wrote:
         | No, they still have "<think>", but it's shorter by removing
         | part of a term.
        
           | mirekrusin wrote:
           | That's what I mean, those CoT are never ending currently
           | until you run out of context.
        
             | refulgentis wrote:
             | I'm not sure if you're talking conversationally and I'm
             | taking it as a technical query, or you're saying CoT never
             | terminate for you and asking for input, or asking what the
             | paper implies about CoT, or relaying that you understand
             | the papers claim that this method net reduces CoT length.
        
               | mirekrusin wrote:
               | Verbose, non terminating CoT is currently common problem
               | with open weight models based on R1-zero methods.
               | 
               | Currently it seems that this shift of cost to inference-
               | time is a necessary tradeoff that we have to live with
               | (for now at least).
               | 
               | It's more of a problem for many people running those
               | models locally because they have constrained hardware
               | that can't handle those long contexts.
               | 
               | It seems that this paper shows not only that their method
               | is cheaper in terms of fine tuning but also significantly
               | reduces inference time cost for CoTs.
               | 
               | If what they say gets confirmed it looks to me like quite
               | significant contribution?
        
       | drakenot wrote:
       | I've seen the same "Superficial Self-Reflection" mentioned in
       | their linked blog post[0] as well, where the conclusion doesn't
       | naturally follow the output of the thinking tokens. I think
       | people are fooled by this, but if you take the time to inspect
       | the "chain of thought" tokens they often don't match the final
       | output answer.
       | 
       | I don't deny that performance for certain logic tasks goes up
       | with these models but I don't fully understand what role the
       | thinking tokens take in these cases.
       | 
       | [0] https://oatllm.notion.site/oat-zero
        
         | andai wrote:
         | I heard that even just getting the model to print a bunch of
         | whitespace ("think for longer") improves the quality of the
         | final response, because some kind of processing is still
         | happening internally?
        
           | MoonGhost wrote:
           | Could it be that model just uses latent space for thinking
           | while generating almost garbage? Interesting to check if
           | adding repeating something at the end of prompt helps. I.e.
           | model uses it for 'thinking'.
        
             | refulgentis wrote:
             | Latent space has a certain shape, which may mean I'm
             | missing a technical distinction.*
             | 
             | There's been publications with a pause token
             | (https://arxiv.org/abs/2310.02226), backspace token
             | (https://arxiv.org/abs/2306.05426), or a think token
             | (https://arxiv.org/html/2405.08644v1#Ch0.S4), all of them
             | based on the theory that a generic token can sort of act a
             | placeholder for manipulating attention further without
             | further meaningful output.
             | 
             | However, in practice, those approaches haven't been used in
             | the training of a large scale model, i.e. I haven't seen it
             | at all, the most adventurous people have gotten at scale is
             | doing Mamba. (and RL)
             | 
             | * It had a particular technical meaning. The first round of
             | the telephone game was when it came to mean "a 3 spatial
             | dimensions-like space, with N dimensions, in an image
             | diffusion model, that contains all possible image styles,
             | that is navigated by a prompt." We're many iterations
             | afield of it now, I'm afraid. Now, you sort of have to
             | interpret it like you would negative space, defined by what
             | it is around it.
        
           | integralof6y wrote:
           | printing a bunch of whitespace is a way of entering into a
           | new state ( I am thinking about a state machine), so the LLM
           | can use that whitespace as a new token that can be used later
           | to refine the state of the system. In math terms, whitespace
           | is a tag for a class (or state) in the LLM. I think that
           | perhaps RL can take advantage of such tags. For example
           | whitespace could indicate a point of low gradient
           | (indetermination) or a branching point, the LLM in some way
           | would learn to enhance the learning rate parameter, so the
           | message in the head of the LLM is: be ready to learn from RL
           | because in your actual state you need to take a branch from a
           | branching point that can enhance your capabilities. This is
           | similar to tossing a coin or a die. The rule could be: when
           | whitespace do increase the learning rate parameter to escape
           | from zero gradient points. Caveat emptor: This is just an
           | speculation, I don't have any data to support this
           | hypothesis. Also this suggests that whitespace could be a
           | "token that reflects the state of previous layers" and is not
           | contained in the vocabulary used to train the model, so I
           | should say that whitespace is a macro-token or neurotoken. If
           | this hypothesis has some ground then it could also be
           | plausible that whitespace could be an enumerate neural tag in
           | the sense that the length of whitespace reflects or is
           | related to the layer in which the zero gradient or branching
           | point occurs. Finally, my throwaway user need whitespace so I
           | will change the password to a random one to force myself to
           | avoid adding new ideas.
        
           | andai wrote:
           | Here's one where just appending dots to the output improved
           | quality:
           | 
           | https://arxiv.org/abs/2404.15758
        
         | YeGoblynQueenne wrote:
         | More tokens = fewer options for the final string. It's not any
         | more complicated than that and it doesn't require any
         | reasoning, just an autoregressively trained statistical model
         | of text, but no, it has to be "the model thinks harder if it
         | outputs more tokens".
        
       | mentalgear wrote:
       | Overall the industry needs more review, less hype. I was shocked
       | to find out SWE-verified [0] is all but verified.
       | 
       | [0] benchmark used by all major vendors to "showcase" coding
       | ability, turns out to be <10% properly solved:
       | https://www.youtube.com/watch?v=QnOc_kKKuac
        
         | belter wrote:
         | Failure modes are also interesting to show what is happening or
         | not really happening. Like the test of asking GenAI to create
         | clocks at specific times, or people drawing with the left hand.
         | All you get are clocks, at 10 min past two, or people drawing
         | with the right hand, since it's 99% of what is in the training
         | data.
         | 
         | Like Sabine says, if the LLM models, already read all the Math
         | books in the world but are not yet able to do basic math,
         | without calling upon a calculator, how much reasoning is really
         | emerging?
         | 
         | "The Path to AGI is Coming Into View":
         | https://youtu.be/mfbRHhOCgzs?t=219
        
           | refulgentis wrote:
           | I'm not sure what Sabine means. It is a somewhat obvious
           | category error, and fundamentally in error regardless. (I
           | find it hard to believe that, for example, Sabine would beat
           | an LLM on a randomly selection of 10 3x3 digit multiplication
           | problem to be complicated in 60 seconds max, by either party)
        
           | immibis wrote:
           | Or overflowing wine glasses.
        
           | fragmede wrote:
           | numeracy isn't mathmatical reasoning
        
       | fancyfredbot wrote:
       | The article starts by saying
       | 
       | "DeepSeek-V3-Base already exhibit 'Aha moment'."
       | 
       | I tried to read the screenshot they present as evidence of this,
       | and indeed it does say "Aha!". But both the preceding reasoning
       | and the following conclusion look like gibberish to me. I'm not
       | sure what we're supposed to conclude here and I gave up reading
       | the article after this inauspicious start.
        
       ___________________________________________________________________
       (page generated 2025-03-23 23:02 UTC)