[HN Gopher] TinyZero
       ___________________________________________________________________
        
       TinyZero
        
       Author : fzliu
       Score  : 189 points
       Date   : 2025-01-25 03:38 UTC (19 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Tepix wrote:
       | Unrolled non-X link with the announcement:
       | https://threadreaderapp.com/thread/1882839370505621655.html
        
       | blackeyeblitzar wrote:
       | What does it mean to reproduce DeepSeek R1-Zero? Like they have a
       | model of equivalent performance? Is there a simple explanation of
       | this post for those who aren't machine learning experts?
       | 
       | Also is the technique here related at all to the technique people
       | think DeepSeek themselves used, where they apparently trained the
       | model using OpenAI outputs?
        
         | 3nthusia5t wrote:
         | Could you provide source for the training the model on OpenAI
         | outputs? I can't find any news about that.
        
           | blackeyeblitzar wrote:
           | I don't have a source to share, but I saw this claim on
           | social media a few times in the last couple days, where
           | people said their conversation with the model revealed that
           | it thought it was some other OpenAI model. I have no idea how
           | such training can work using another model's output, but I
           | saw comments claiming that this is why their training was so
           | cheap.
        
         | suraci wrote:
         | > What does it mean to reproduce DeepSeek R1-Zero?
         | 
         | means it's reproducible
        
           | coolThingsFirst wrote:
           | Westerners cant reproduce chinese geniuses
        
         | evertedsphere wrote:
         | reproducing the alphazero-like "model learns to reason on its
         | own without supervised fine-tuning" phenomenon that
         | deepseek-r1-zero exhibited
        
         | dvh wrote:
         | Reminds me of old polish encyclopedia: horse - everybody knows
         | what horse is
         | 
         | https://en.wikipedia.org/wiki/Nowe_Ateny
        
         | coolThingsFirst wrote:
         | I think there are 2 levels in the brain.
         | 
         | One is used for programming the other for language. Doing them
         | in parallel fails for some reason.
         | 
         | A lot of GH projects just don't have solid explanation - i
         | don't know what they built.
        
         | zamadatix wrote:
         | R1-Zero is trained differently than most reasoning models, such
         | as the "normal" R1 model, in regards what steps are done in
         | training. TinyZero applies the same approach (but only on a
         | subset of use cases) on a much smaller model to show it can
         | apply on much smaller models as well.
         | 
         | The details of how it's trained different start to get into
         | "machine learning expert" territory but you can get a decent
         | high level via a casual read through of the DeepSeek link if
         | you want to dive deeper.
        
       | serialx wrote:
       | So to my understanding, this work reproduces DeepSeek R1's
       | reinforcement learning mechanism in a very small language model.
       | 
       | The AI gets "rewards" (like points) for doing two things
       | correctly:
       | 
       | Accuracy : Getting the right answer. For example, math answers
       | must be in a specific format (e.g., inside a box) so a computer
       | can easily check them. For coding problems, test cases verify if
       | the code works.
       | 
       | Format : Using the <think> and <answer> tags properly. This
       | forces the AI to organize its responses clearly.
       | 
       | So in this case, the training program can extract the model's
       | answer by parsing <answer> tag. We can eval the answer and
       | evaluate if it's correct or not. If it's correct give reward,
       | else: no reward.
       | 
       | Create N such answers from a single question, create N reward
       | array. This is enough for the RL algorithm to guide the model to
       | be more smart.
        
         | suraci wrote:
         | It looks like the 'old-school' RL to me, which makes me wonder
         | why it took so long to get here
        
           | vixen99 wrote:
           | Nothing like acronyms to make me feel dumb and ill-informed.
        
             | basementcat wrote:
             | Reinforcement Learning
             | 
             | https://en.m.wikipedia.org/wiki/Reinforcement_learning
        
         | krackers wrote:
         | I've been trying to follow the literature on PPO/GRPO as
         | applied to LLMs. From what I understand, since reward is only
         | given once the entire COT sequence is sampled, traditional RL
         | techniques would require some form of credit-assignment to
         | distribute that reward amongst individual tokens - which is
         | where the critic/value network comes in, right?
         | 
         | Instead DeepSeek (with GRPO) seems to just omit that value
         | function entirely and use only sparse rewards. How does this
         | end up being more efficient, since I thought the sparse nature
         | of rewards makes it harder to converge to the optimal policy?
        
           | serialx wrote:
           | I don't think it's only using sparse rewards because of the
           | format rewards. The training recipe is pretty comprehensive
           | and involves multiple stages.[1] The paper mentions that when
           | only using the RL technique, the output is often not suitable
           | for reading. (Language mixing, etc) That feels like a
           | AlphaZero moment for LLMs?
           | 
           | [1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/not
           | es_o...
        
             | krackers wrote:
             | The R1 paper says that they didn't use "process reward
             | modeling". And the paper that introduced GPRO says that it
             | can be used either with "outcome supervision" or "process
             | supervision", with outcome supervision "only provid[ing] a
             | reward at the end of each output". Put together, doesn't
             | that imply R1 uses sparse rewards provided only at end of
             | COT sequence?
        
               | serialx wrote:
               | Ah sorry, you might be right. I meant "sparse reward" as
               | a reward system that is mostly 0 but occasionally 1. Your
               | "sparse reward" means only providing reward at the end of
               | each output.
        
               | HeatrayEnjoyer wrote:
               | > Ah sorry, you might be right. I meant "sparse reward"
               | as a reward system that is mostly 0 but occasionally 1.
               | 
               | Did we introduce the abusive pressure of Korean
               | educational culture to machines?
        
         | amluto wrote:
         | The part I found strange: these RL formulations give no reward
         | for incorrect solutions, so unless there are training examples
         | that are easy enough for the base model to solve, the RL
         | process won't do anything.
         | 
         | So is the actual magic that the base models are good enough to
         | sometimes generate successful CoT output in their unmodified
         | state? Or did I miss something in the R1 paper and the code
         | here?
        
           | Imanari wrote:
           | I was wondering the same thing. I feel there is too large of
           | a gap between a raw base model and and a model that produces
           | fully correct answers and follows a specific format. My guess
           | is their rule base reward system is more nuanced than just
           | correctness and format.
        
             | krackers wrote:
             | Yeah I find this part not clearly expressed as well. My
             | best guess is that it's not simply binary
             | "correct/incorrect" but rather the reward is made up of
             | multiple parts (e.g. format + correctness) and structured
             | in a way such that "close enough" answers still get some
             | reward. From there I would expect that a base model might
             | at least be able to "autocomplete" the format/style, at
             | which point RL machinery would kick in to tune it to
             | properly obey the format, and once that's mastered
             | eventually correctness.
             | 
             | They did mention something about tuning on an un-SFT'd base
             | model being much slower 'warming it up' with some existing
             | reasoning traces.
        
           | zby wrote:
           | I think is where the relative rewards come to play - they
           | sample many thinking traces and reward those that are
           | correct. This works at the current 'cutting edge' for the
           | model - exactly where it could be improved.
        
         | zby wrote:
         | I think the reward is relative to other sampled answers for the
         | same question. This way the signal is strong at the very margin
         | of what is possible with a given model and there is less noise
         | in it with impossible or too easy questions.
         | 
         | There is some confusion - because they do compute that simple
         | reward, but then they convert it to a relative value and call
         | it advantage. And I think they use that advantage to update the
         | model - not the base reward.
        
           | krackers wrote:
           | Yes you're right, in their paper I think they say the process
           | of sampling multiple traces then taking relative rewards is
           | supposed to monte-carlo approximate the value network? I
           | don't really have the intuition for that, but it does make
           | sense that rather than simply nudging probabilities in the
           | direction of the trace with the highest absolute reward, you
           | want to favor the trace which had the best reward relative to
           | current state. E.g. for quick intuition if absolute rewards
           | for traces were {0, 0, 0, 0.01} then using absolute rewards
           | would only give a weak signal (nudge weights proportional to
           | 0.01 * logprob) for the last trace, but using relative
           | rewards (based on z-score) of 1.5 * logprob.
        
             | zby wrote:
             | Not only that - if you have {0,0,0,0.01} - then the
             | probability that you would get any reward at one shot would
             | be very low. And also I have the intuition that giving the
             | rewards to traces at the edge is more efficient - because
             | the model needs only a small perturbation to get right. If
             | you gave negative rewards to traces that are very far from
             | being right - then the model might be steered in a wrong
             | direction.
        
       | nxobject wrote:
       | The author notes in their Twitter announcement [a] that their
       | model's reasoning abilities are only validated within the domain
       | directly within their the domain of their Countdown training
       | material. They admit that the real test of this training method
       | is whether it produces outputs that pass the sniff test in other
       | subject domains, or even abstract reasoning. However, given that
       | there are "standardized test style" abstract reasoning tests with
       | relatively small corpora (eg. ZebraLogic [b] on the order of 1000
       | or so cases), I do think they missed an opportunity to... do
       | _some_ small benchmark for abstract reasoning before
       | announcement.
       | 
       | [a] https://threadreaderapp.com/thread/1882839370505621655.html -
       | thanks @Tepix
       | 
       | [b] https://huggingface.co/blog/yuchenlin/zebra-logic
        
       ___________________________________________________________________
       (page generated 2025-01-25 23:01 UTC)