[HN Gopher] Using GRPO to Beat o1, o3-mini and R1 at "Temporal C...
___________________________________________________________________
Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"
Author : kcorbitt
Score : 74 points
Date : 2025-03-06 19:51 UTC (3 hours ago)
(HTM) web link (openpipe.ai)
(TXT) w3m dump (openpipe.ai)
| kcorbitt wrote:
| One of the authors here. Happy to answer any questions about our
| methods/results!
| bydgjohc wrote:
| Any hypotheses on why the performance dropped suddenly while
| training?
| bradhilton wrote:
| Hi, other author here. I think the models converged on
| shallow/greedy strategies that improved performance up to a
| point, but are ultimately shortsighted, especially for harder
| puzzles.
|
| Something interesting I noticed in the responses was that for
| shorter puzzles it would make deductions, building up a set
| additional "clues" for itself, before answering the question.
| However, for harder puzzles with more clues it would often
| merely repeat all the given clues and then try to directly
| answer the questions.
|
| Maybe some form of curriculum learning would help, starting
| with easier puzzles and progressing to more challenging ones.
|
| Other ideas to explore include:
|
| - Distilling responses from stronger models - Encouraging
| exploration with entropy regularization or reward shaping -
| Training from base models instead of instruct models, like
| DeepSeek-R1-Zero
| bradhilton wrote:
| As for why they dropped _suddenly_ , I don't really know.
| Sometimes models develop degenerate behaviors, but even when
| forking from the best checkpoint and lowering the learning
| rate or changing other hyperparameters, performance stills
| drops. It's as if its fate has already been sealed many
| iterations ago.
| snovv_crash wrote:
| Do you have any other logic puzzles you could use to see if the
| performance generalises?
| kcorbitt wrote:
| To be honest, I don't expect the performance to generalize to
| other task types with this specific training regime. If we
| had a panel of like 30 logic puzzles and cross-trained
| against all of them simultaneously it might though.
|
| I think there's a lot of benefit to discovering a training
| regime that allows small specialized models to do extremely
| well in one narrow task; if we can figure out how to make
| small models that beat SOTA on a specific task and are cheap
| to train and run, that's in some ways a more useful outcome
| than a very large model that is good at many tasks (but is
| more expensive to run for each of them).
| mdp2021 wrote:
| Can I just wholeheartedly congratulate you for having found a
| critical benchmark to evaluate LLMs. Either they achieve 100%
| accuracy in your game, or they cannot be considered
| trustworthy. I remain very confident that modules must be added
| to the available architectures to achieve the "strict 100%"
| result.
| pama wrote:
| Can you elaborate on this point:
|
| " We discovered that meaningful performance improvements, as
| high as 10-15%, can be achieved with as few as 16 training
| examples."
|
| In particular, did you need to change the hyperparameters much,
| and did this limited recipe show different improvements for the
| larger vs smaller models? Also, how did you select these 16
| examples?
| bradhilton wrote:
| No meaningful changes to the hyperparameters, just changed
| the tasks per iteration to 16 and trained on the same first
| 16 training tasks each iteration.
|
| We only tested this with the 14B model. You can see the run
| here:
|
| https://wandb.ai/bradhilton/rl-experiments/runs/062
|
| Performance peaked after 21 iterations at 45% accuracy
| instead of the final 59%, but still a significant increase on
| very few samples.
| pama wrote:
| Thanks.
| malcolmgreaves wrote:
| Please define an acronym the first time you use it in the body
| text. I had to scroll about 20% the way through your article
| just to understand the title.
| bradhilton wrote:
| Great point! Thanks for the feedback.
| behnamoh wrote:
| this is the same team that a few months ago here on hacker news
| talked about how to do fine-tuning on large language models, and
| then made it close source.
| Imnimo wrote:
| >To speed up our experiments, we omitted the Kullback-Leibler
| (KL) divergence penalty, although our training recipe supports it
| for interested readers.
|
| I am very curious whether omitting the KL penalty helps on narrow
| domains like this, and also whether doing so results in illegible
| reasoning. (From the samples in the post, it looks like it
| doesn't make reasoning illegible?)
|
| >the 32B model's response lengths collapsing, especially after
| reaching peak performance.
|
| I would not have predicted this. Nor that it could collapse its
| response length to near zero yet lose only a few percentage
| points of accuracy. If you do SFT to get a model of the same size
| to solve these puzzles with no reasoning (just output answers
| directly), how good can it do?
| bradhilton wrote:
| Yeah, it may help. In this paper[1], the author used a KL
| penalty of 0.01 for general tasks and 0.001 for mathematical. I
| tend to think it's probably not very important unless you're
| trying to optimize for human preferences.
|
| As for response length, I think the model internalizes the
| logic and doesn't deliberate its answers through context
| creation. I don't think this is necessarily good for general
| reasoning, but for a specific task it would cut down inference
| costs. Just depends on what you're optimizing for. To encourage
| more general reasoning, I think a broader train and validation
| set would be helpful.
|
| [1] https://arxiv.org/html/2501.03262v1
| jstanley wrote:
| I keep seeing people mention "illegible reasoning" but I'd be
| fascinated to see an example of what it actually looks like. Do
| you have any examples?
|
| Apparently DeepSeek-R1 can switch between English, Chinese, and
| gibberish, and even the gibberish helps it think! That's
| fascinating, but all I can find is people _saying_ it, nobody
| showing it.
| Imnimo wrote:
| Here's an example of language switching:
|
| https://gr.inc/question/although-a-few-years-ago-the-
| fundame...
|
| In the dropdown set to DeepSeek-R1, switch to the LIMO model
| (which apparently has a high frequency of language
| switching).
|
| I'm not sure about examples of gibberish or totally illegible
| reasoning. My guess is that since R1-Zero still had the KL
| penalty, it should all be _somewhat_ legible - the KL penalty
| encourages the model to not move too far from what the base
| model would say in any given context.
| jstanley wrote:
| Thanks, that's cool to see. I hadn't seen this site before
| but browsing around I also found this example:
| https://gr.inc/question/why-does-the-professor-say-this-
| good... - also with LIMO.
| Tostino wrote:
| I couldn't quickly find it by searching your github, but what
| layers did you end up targeting for training? Would be
| interesting to see an ablation on targeting different sets of
| layers (train only attention layers, freeze the first 30% of the
| layers and train the remaining 70%, etc).
| bradhilton wrote:
| We trained all the parameters. Those would definitely be
| interesting ablations. I would also like to see how much of a
| performance hit we would take with PEFT methods like LoRA.
| layer8 wrote:
| GRPO = Group Relative Policy Optimization
|
| https://arxiv.org/abs/2402.03300
| randomcatuser wrote:
| Wait, what's the difference between using GRPO and traditional
| fine-tuning of Qwen using your provided dataset?
|
| Would be super interesting to see which one is more data-
| efficient!
| bradhilton wrote:
| Great question! So the dataset includes prompts and solutions,
| but no "gold" answer per se to use for SFT. You could sample
| responses from larger models and then train the smaller model
| on their answers, but as outlined in the benchmarks there is
| still a lot of headroom on this task and I wouldn't expect that
| to get the same results. At the very least you would probably
| want to do rejection sampling to discard bad results. It would
| definitely be a good experiment!
___________________________________________________________________
(page generated 2025-03-06 23:00 UTC)