[HN Gopher] TinyZero
___________________________________________________________________
TinyZero
Author : fzliu
Score : 189 points
Date : 2025-01-25 03:38 UTC (19 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Tepix wrote:
| Unrolled non-X link with the announcement:
| https://threadreaderapp.com/thread/1882839370505621655.html
| blackeyeblitzar wrote:
| What does it mean to reproduce DeepSeek R1-Zero? Like they have a
| model of equivalent performance? Is there a simple explanation of
| this post for those who aren't machine learning experts?
|
| Also is the technique here related at all to the technique people
| think DeepSeek themselves used, where they apparently trained the
| model using OpenAI outputs?
| 3nthusia5t wrote:
| Could you provide source for the training the model on OpenAI
| outputs? I can't find any news about that.
| blackeyeblitzar wrote:
| I don't have a source to share, but I saw this claim on
| social media a few times in the last couple days, where
| people said their conversation with the model revealed that
| it thought it was some other OpenAI model. I have no idea how
| such training can work using another model's output, but I
| saw comments claiming that this is why their training was so
| cheap.
| suraci wrote:
| > What does it mean to reproduce DeepSeek R1-Zero?
|
| means it's reproducible
| coolThingsFirst wrote:
| Westerners cant reproduce chinese geniuses
| evertedsphere wrote:
| reproducing the alphazero-like "model learns to reason on its
| own without supervised fine-tuning" phenomenon that
| deepseek-r1-zero exhibited
| dvh wrote:
| Reminds me of old polish encyclopedia: horse - everybody knows
| what horse is
|
| https://en.wikipedia.org/wiki/Nowe_Ateny
| coolThingsFirst wrote:
| I think there are 2 levels in the brain.
|
| One is used for programming the other for language. Doing them
| in parallel fails for some reason.
|
| A lot of GH projects just don't have solid explanation - i
| don't know what they built.
| zamadatix wrote:
| R1-Zero is trained differently than most reasoning models, such
| as the "normal" R1 model, in regards what steps are done in
| training. TinyZero applies the same approach (but only on a
| subset of use cases) on a much smaller model to show it can
| apply on much smaller models as well.
|
| The details of how it's trained different start to get into
| "machine learning expert" territory but you can get a decent
| high level via a casual read through of the DeepSeek link if
| you want to dive deeper.
| serialx wrote:
| So to my understanding, this work reproduces DeepSeek R1's
| reinforcement learning mechanism in a very small language model.
|
| The AI gets "rewards" (like points) for doing two things
| correctly:
|
| Accuracy : Getting the right answer. For example, math answers
| must be in a specific format (e.g., inside a box) so a computer
| can easily check them. For coding problems, test cases verify if
| the code works.
|
| Format : Using the <think> and <answer> tags properly. This
| forces the AI to organize its responses clearly.
|
| So in this case, the training program can extract the model's
| answer by parsing <answer> tag. We can eval the answer and
| evaluate if it's correct or not. If it's correct give reward,
| else: no reward.
|
| Create N such answers from a single question, create N reward
| array. This is enough for the RL algorithm to guide the model to
| be more smart.
| suraci wrote:
| It looks like the 'old-school' RL to me, which makes me wonder
| why it took so long to get here
| vixen99 wrote:
| Nothing like acronyms to make me feel dumb and ill-informed.
| basementcat wrote:
| Reinforcement Learning
|
| https://en.m.wikipedia.org/wiki/Reinforcement_learning
| krackers wrote:
| I've been trying to follow the literature on PPO/GRPO as
| applied to LLMs. From what I understand, since reward is only
| given once the entire COT sequence is sampled, traditional RL
| techniques would require some form of credit-assignment to
| distribute that reward amongst individual tokens - which is
| where the critic/value network comes in, right?
|
| Instead DeepSeek (with GRPO) seems to just omit that value
| function entirely and use only sparse rewards. How does this
| end up being more efficient, since I thought the sparse nature
| of rewards makes it harder to converge to the optimal policy?
| serialx wrote:
| I don't think it's only using sparse rewards because of the
| format rewards. The training recipe is pretty comprehensive
| and involves multiple stages.[1] The paper mentions that when
| only using the RL technique, the output is often not suitable
| for reading. (Language mixing, etc) That feels like a
| AlphaZero moment for LLMs?
|
| [1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/not
| es_o...
| krackers wrote:
| The R1 paper says that they didn't use "process reward
| modeling". And the paper that introduced GPRO says that it
| can be used either with "outcome supervision" or "process
| supervision", with outcome supervision "only provid[ing] a
| reward at the end of each output". Put together, doesn't
| that imply R1 uses sparse rewards provided only at end of
| COT sequence?
| serialx wrote:
| Ah sorry, you might be right. I meant "sparse reward" as
| a reward system that is mostly 0 but occasionally 1. Your
| "sparse reward" means only providing reward at the end of
| each output.
| HeatrayEnjoyer wrote:
| > Ah sorry, you might be right. I meant "sparse reward"
| as a reward system that is mostly 0 but occasionally 1.
|
| Did we introduce the abusive pressure of Korean
| educational culture to machines?
| amluto wrote:
| The part I found strange: these RL formulations give no reward
| for incorrect solutions, so unless there are training examples
| that are easy enough for the base model to solve, the RL
| process won't do anything.
|
| So is the actual magic that the base models are good enough to
| sometimes generate successful CoT output in their unmodified
| state? Or did I miss something in the R1 paper and the code
| here?
| Imanari wrote:
| I was wondering the same thing. I feel there is too large of
| a gap between a raw base model and and a model that produces
| fully correct answers and follows a specific format. My guess
| is their rule base reward system is more nuanced than just
| correctness and format.
| krackers wrote:
| Yeah I find this part not clearly expressed as well. My
| best guess is that it's not simply binary
| "correct/incorrect" but rather the reward is made up of
| multiple parts (e.g. format + correctness) and structured
| in a way such that "close enough" answers still get some
| reward. From there I would expect that a base model might
| at least be able to "autocomplete" the format/style, at
| which point RL machinery would kick in to tune it to
| properly obey the format, and once that's mastered
| eventually correctness.
|
| They did mention something about tuning on an un-SFT'd base
| model being much slower 'warming it up' with some existing
| reasoning traces.
| zby wrote:
| I think is where the relative rewards come to play - they
| sample many thinking traces and reward those that are
| correct. This works at the current 'cutting edge' for the
| model - exactly where it could be improved.
| zby wrote:
| I think the reward is relative to other sampled answers for the
| same question. This way the signal is strong at the very margin
| of what is possible with a given model and there is less noise
| in it with impossible or too easy questions.
|
| There is some confusion - because they do compute that simple
| reward, but then they convert it to a relative value and call
| it advantage. And I think they use that advantage to update the
| model - not the base reward.
| krackers wrote:
| Yes you're right, in their paper I think they say the process
| of sampling multiple traces then taking relative rewards is
| supposed to monte-carlo approximate the value network? I
| don't really have the intuition for that, but it does make
| sense that rather than simply nudging probabilities in the
| direction of the trace with the highest absolute reward, you
| want to favor the trace which had the best reward relative to
| current state. E.g. for quick intuition if absolute rewards
| for traces were {0, 0, 0, 0.01} then using absolute rewards
| would only give a weak signal (nudge weights proportional to
| 0.01 * logprob) for the last trace, but using relative
| rewards (based on z-score) of 1.5 * logprob.
| zby wrote:
| Not only that - if you have {0,0,0,0.01} - then the
| probability that you would get any reward at one shot would
| be very low. And also I have the intuition that giving the
| rewards to traces at the edge is more efficient - because
| the model needs only a small perturbation to get right. If
| you gave negative rewards to traces that are very far from
| being right - then the model might be steered in a wrong
| direction.
| nxobject wrote:
| The author notes in their Twitter announcement [a] that their
| model's reasoning abilities are only validated within the domain
| directly within their the domain of their Countdown training
| material. They admit that the real test of this training method
| is whether it produces outputs that pass the sniff test in other
| subject domains, or even abstract reasoning. However, given that
| there are "standardized test style" abstract reasoning tests with
| relatively small corpora (eg. ZebraLogic [b] on the order of 1000
| or so cases), I do think they missed an opportunity to... do
| _some_ small benchmark for abstract reasoning before
| announcement.
|
| [a] https://threadreaderapp.com/thread/1882839370505621655.html -
| thanks @Tepix
|
| [b] https://huggingface.co/blog/yuchenlin/zebra-logic
___________________________________________________________________
(page generated 2025-01-25 23:01 UTC)