[HN Gopher] Using GRPO to Beat o1, o3-mini and R1 at "Temporal C...
       ___________________________________________________________________
        
       Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"
        
       Author : kcorbitt
       Score  : 74 points
       Date   : 2025-03-06 19:51 UTC (3 hours ago)
        
 (HTM) web link (openpipe.ai)
 (TXT) w3m dump (openpipe.ai)
        
       | kcorbitt wrote:
       | One of the authors here. Happy to answer any questions about our
       | methods/results!
        
         | bydgjohc wrote:
         | Any hypotheses on why the performance dropped suddenly while
         | training?
        
           | bradhilton wrote:
           | Hi, other author here. I think the models converged on
           | shallow/greedy strategies that improved performance up to a
           | point, but are ultimately shortsighted, especially for harder
           | puzzles.
           | 
           | Something interesting I noticed in the responses was that for
           | shorter puzzles it would make deductions, building up a set
           | additional "clues" for itself, before answering the question.
           | However, for harder puzzles with more clues it would often
           | merely repeat all the given clues and then try to directly
           | answer the questions.
           | 
           | Maybe some form of curriculum learning would help, starting
           | with easier puzzles and progressing to more challenging ones.
           | 
           | Other ideas to explore include:
           | 
           | - Distilling responses from stronger models - Encouraging
           | exploration with entropy regularization or reward shaping -
           | Training from base models instead of instruct models, like
           | DeepSeek-R1-Zero
        
           | bradhilton wrote:
           | As for why they dropped _suddenly_ , I don't really know.
           | Sometimes models develop degenerate behaviors, but even when
           | forking from the best checkpoint and lowering the learning
           | rate or changing other hyperparameters, performance stills
           | drops. It's as if its fate has already been sealed many
           | iterations ago.
        
         | snovv_crash wrote:
         | Do you have any other logic puzzles you could use to see if the
         | performance generalises?
        
           | kcorbitt wrote:
           | To be honest, I don't expect the performance to generalize to
           | other task types with this specific training regime. If we
           | had a panel of like 30 logic puzzles and cross-trained
           | against all of them simultaneously it might though.
           | 
           | I think there's a lot of benefit to discovering a training
           | regime that allows small specialized models to do extremely
           | well in one narrow task; if we can figure out how to make
           | small models that beat SOTA on a specific task and are cheap
           | to train and run, that's in some ways a more useful outcome
           | than a very large model that is good at many tasks (but is
           | more expensive to run for each of them).
        
         | mdp2021 wrote:
         | Can I just wholeheartedly congratulate you for having found a
         | critical benchmark to evaluate LLMs. Either they achieve 100%
         | accuracy in your game, or they cannot be considered
         | trustworthy. I remain very confident that modules must be added
         | to the available architectures to achieve the "strict 100%"
         | result.
        
         | pama wrote:
         | Can you elaborate on this point:
         | 
         | " We discovered that meaningful performance improvements, as
         | high as 10-15%, can be achieved with as few as 16 training
         | examples."
         | 
         | In particular, did you need to change the hyperparameters much,
         | and did this limited recipe show different improvements for the
         | larger vs smaller models? Also, how did you select these 16
         | examples?
        
           | bradhilton wrote:
           | No meaningful changes to the hyperparameters, just changed
           | the tasks per iteration to 16 and trained on the same first
           | 16 training tasks each iteration.
           | 
           | We only tested this with the 14B model. You can see the run
           | here:
           | 
           | https://wandb.ai/bradhilton/rl-experiments/runs/062
           | 
           | Performance peaked after 21 iterations at 45% accuracy
           | instead of the final 59%, but still a significant increase on
           | very few samples.
        
             | pama wrote:
             | Thanks.
        
         | malcolmgreaves wrote:
         | Please define an acronym the first time you use it in the body
         | text. I had to scroll about 20% the way through your article
         | just to understand the title.
        
           | bradhilton wrote:
           | Great point! Thanks for the feedback.
        
       | behnamoh wrote:
       | this is the same team that a few months ago here on hacker news
       | talked about how to do fine-tuning on large language models, and
       | then made it close source.
        
       | Imnimo wrote:
       | >To speed up our experiments, we omitted the Kullback-Leibler
       | (KL) divergence penalty, although our training recipe supports it
       | for interested readers.
       | 
       | I am very curious whether omitting the KL penalty helps on narrow
       | domains like this, and also whether doing so results in illegible
       | reasoning. (From the samples in the post, it looks like it
       | doesn't make reasoning illegible?)
       | 
       | >the 32B model's response lengths collapsing, especially after
       | reaching peak performance.
       | 
       | I would not have predicted this. Nor that it could collapse its
       | response length to near zero yet lose only a few percentage
       | points of accuracy. If you do SFT to get a model of the same size
       | to solve these puzzles with no reasoning (just output answers
       | directly), how good can it do?
        
         | bradhilton wrote:
         | Yeah, it may help. In this paper[1], the author used a KL
         | penalty of 0.01 for general tasks and 0.001 for mathematical. I
         | tend to think it's probably not very important unless you're
         | trying to optimize for human preferences.
         | 
         | As for response length, I think the model internalizes the
         | logic and doesn't deliberate its answers through context
         | creation. I don't think this is necessarily good for general
         | reasoning, but for a specific task it would cut down inference
         | costs. Just depends on what you're optimizing for. To encourage
         | more general reasoning, I think a broader train and validation
         | set would be helpful.
         | 
         | [1] https://arxiv.org/html/2501.03262v1
        
         | jstanley wrote:
         | I keep seeing people mention "illegible reasoning" but I'd be
         | fascinated to see an example of what it actually looks like. Do
         | you have any examples?
         | 
         | Apparently DeepSeek-R1 can switch between English, Chinese, and
         | gibberish, and even the gibberish helps it think! That's
         | fascinating, but all I can find is people _saying_ it, nobody
         | showing it.
        
           | Imnimo wrote:
           | Here's an example of language switching:
           | 
           | https://gr.inc/question/although-a-few-years-ago-the-
           | fundame...
           | 
           | In the dropdown set to DeepSeek-R1, switch to the LIMO model
           | (which apparently has a high frequency of language
           | switching).
           | 
           | I'm not sure about examples of gibberish or totally illegible
           | reasoning. My guess is that since R1-Zero still had the KL
           | penalty, it should all be _somewhat_ legible - the KL penalty
           | encourages the model to not move too far from what the base
           | model would say in any given context.
        
             | jstanley wrote:
             | Thanks, that's cool to see. I hadn't seen this site before
             | but browsing around I also found this example:
             | https://gr.inc/question/why-does-the-professor-say-this-
             | good... - also with LIMO.
        
       | Tostino wrote:
       | I couldn't quickly find it by searching your github, but what
       | layers did you end up targeting for training? Would be
       | interesting to see an ablation on targeting different sets of
       | layers (train only attention layers, freeze the first 30% of the
       | layers and train the remaining 70%, etc).
        
         | bradhilton wrote:
         | We trained all the parameters. Those would definitely be
         | interesting ablations. I would also like to see how much of a
         | performance hit we would take with PEFT methods like LoRA.
        
       | layer8 wrote:
       | GRPO = Group Relative Policy Optimization
       | 
       | https://arxiv.org/abs/2402.03300
        
       | randomcatuser wrote:
       | Wait, what's the difference between using GRPO and traditional
       | fine-tuning of Qwen using your provided dataset?
       | 
       | Would be super interesting to see which one is more data-
       | efficient!
        
         | bradhilton wrote:
         | Great question! So the dataset includes prompts and solutions,
         | but no "gold" answer per se to use for SFT. You could sample
         | responses from larger models and then train the smaller model
         | on their answers, but as outlined in the benchmarks there is
         | still a lot of headroom on this task and I wouldn't expect that
         | to get the same results. At the very least you would probably
         | want to do rejection sampling to discard bad results. It would
         | definitely be a good experiment!
        
       ___________________________________________________________________
       (page generated 2025-03-06 23:00 UTC)