hngopher.com

       [HN Gopher] Does RL Incentivize Reasoning in LLMs Beyond the Bas...
       ___________________________________________________________________
        
       Does RL Incentivize Reasoning in LLMs Beyond the Base Model?
        
       Author : leodriesch
       Score  : 71 points
       Date   : 2025-04-22 10:24 UTC (12 hours ago)
        
 (HTM) web link (limit-of-rlvr.github.io)
 (TXT) w3m dump (limit-of-rlvr.github.io)
        
       | yorwba wrote:
       | They write "We manually inspect CoT validity to ensure correct
       | answers stem from valid reasoning, not lucky guesses." but the
       | example answer they show at the end only gets the correct number
       | due to two errors canceling out. The model calculates
       | 195+367+562+900 and gets 1924 instead of 2024, and also turns
       | -437 - 2*234 into -805 instead of -905, but in total 1924-805 =
       | 2024-905 = 1119 and from there the remaining steps are correct
       | again.
       | 
       | It would be interesting to know how much of the sampling
       | efficiency improvement from reinforcement learning is due to
       | being better at basic arithmetic (something which could also be
       | achieved by giving the model access to a calculator tool) and how
       | much is due to choosing the correct approach for solving the
       | problem more often.
        
       | spwa4 wrote:
       | I don't like papers that ask a question in the title, so here's
       | the answer:
       | 
       | "RL boosts sampling efficiency but reduces the reasoning capacity
       | boundary."
       | 
       | Perhaps better to put it like this: Given one, or few attempts,
       | RL trained models beat non-RL models. Given many attempts, non-RL
       | models come up with better answers.
        
         | sitkack wrote:
         | My gut feeling when using DeepSeek is that its performance is a
         | lot smoother, the responses feel more robust and not as
         | brittle.
        
           | cma wrote:
           | At least with Deep Seek math (with the same RL technique as
           | the later R1) they noted similar things in their paper in the
           | "Why RL Works?" section. Around the 1:04:00 mark of this
           | Yannic Kilcher video review of the Deepseek math paper he
           | goes over that section and points to basically the same
           | limitations as the hn submission paper, starts at around the
           | 1hr 4m mark and ends with this:                   1:05:40
           | the Improvement is attributed to boosting the correct
           | response from Top K         1:05:46         rather than the
           | enhancement of fundamental capabilities this is something
           | that we've come to learn in a         1:05:52         lot of
           | different ways from like reinforcement learning on language
           | 1:05:58         models or even supervised fine-tuning is that
           | what's happening most likely is         1:06:04         more
           | that the capabilities of doing all of these things are
           | already present in         1:06:09         the underlying
           | pre-trained language model
           | 
           | https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4m
           | 
           | from the paper:
           | 
           | > 5.2.2. Why RL Works? > In this paper, we conduct
           | reinforcement learning based on a subset of instruction
           | tuning data, and it achieves significant performance
           | enhancement upon the instruction tuning model. > To further
           | explain why reinforcement learning works. We evaluate the
           | Pass@K and Maj@K accuracy of the Instruct and RL models on
           | two benchmarks. As shown in Figure 7, RL enhances Maj@K's
           | performance but not Pass@K. These findings indicate that RL
           | enhances the model's overall performance by rendering the
           | output distribution more robust, in other words, it seems >
           | that the improvement is attributed to boosting the correct
           | response from TopK rather than the enhancement of fundamental
           | capabilities. Similarly, (Wang et al., 2023a) identified a
           | misalignment problem in reasoning tasks within the SFT model,
           | showing that the reasoning performance of SFT models can be
           | improved through a series of preference alignment strategies
           | > (Song et al., 2023; Wang et al., 2023a; Yuan et al.,
           | 2023b).
           | 
           | In the video he reads into this that these methods alone may
           | not at all get us over the data wall and are still
           | fundamentally limited by the distribution of the base model
           | they augment.
        
             | whatshisface wrote:
             | I don't know a lot about this but it seems like if the
             | sampling performance was adequate, external checks like
             | theorem verification would work to get "over the data
             | wall."
        
               | cma wrote:
               | There have already been good results there with
               | DeepMind's math Olympiad work. I think the LLM portion
               | there was only for translating from informal to formal in
               | the training process and in the final process they still
               | used a manual translation to a formal description and the
               | solver was transformer based and RL trained, but I think
               | not starting with any language base, but it was able to
               | learn some distribution helpful in solving the problems
               | with RL, verifier,and light scaffolding of the tree
               | search alone.
        
             | GloamingNiblets wrote:
             | Thanks for sharing. I had trouble reading the transcript,
             | so here is Claude's cleaned up version and summary:
             | 
             | Here's the condensed and formatted transcription in a
             | single paragraph: This is the last thing I want to
             | highlight this section on why RL works. Here they evaluate
             | different things - they evaluate specifically pass at K and
             | maj at K. Maj at K is like majority voting, so what you do
             | is you have a model, you have a question, and you output
             | not just one output but an ordered set. So you give your
             | top 20 answers - 0 is your best answer that the model wants
             | to give most, then the second most answer, third most
             | answer, and so on. They could all be correct, just
             | different reformulations of the same answer or different
             | derivations stated in different ways. What you're
             | interested in is how many of the top K results are correct
             | - that's the pass at K. And if you had to vote if majority
             | voting on the top K, how often would you be correct then?
             | There's a slight difference, and that slight difference is
             | actually made more drastic by reinforcement learning. They
             | say, "As shown in figure 7, reinforcement learning enhances
             | majority at K performance but not pass at K." These
             | findings indicate that reinforcement learning enhances the
             | model's overall performance by rendering the output
             | distribution more robust. In other words, it seems that the
             | improvement is attributed to boosting the correct response
             | from Top K rather than the enhancement of fundamental
             | capabilities. This is something we've come to learn in many
             | different ways from reinforcement learning on language
             | models or even supervised fine-tuning - what's happening
             | most likely is that the capabilities of doing all of these
             | things are already present in the underlying pre-trained
             | language model. Summary: Reinforcement learning improves
             | language model performance not by enhancing fundamental
             | capabilities but by making the output distribution more
             | robust, effectively boosting correct responses within the
             | top results rather than improving the model's inherent
             | abilities.
        
         | cma wrote:
         | I'm pretty sure RL causes catastrophic forgetting of its base
         | knowledge and that's why o3 hallucinates so much more.
         | 
         | If you mess around with trained weights you're going to delete
         | some base knowledge, as least the knowledge that is outside of
         | the tasks you RL on.
        
           | kadushka wrote:
           | Hallucinations usually happen when a model never knew the
           | answer, not when it forgot something.
        
             | cma wrote:
             | I think this is definitely not true of catastrophic
             | forgetting from finetuning. And with other related types of
             | forgetting from model abliteration there are often extreme
             | increases hallucination.
             | 
             | The InstructGPT paper also showed that RLHF made
             | hallucination worse (with more user data rejecting common
             | hallucinations instruction tuning and RLHF may lower
             | specific hallucinations rejected by users though).
             | 
             | Some mention of that here: https://huyenchip.com/2023/05/02
             | /rlhf.html#rlhf_and_hallucin...
        
               | kadushka wrote:
               | RL might be making hallucinations worse, that's true. Why
               | do you think RL is causing catastrophic forgetting? Are
               | there factual knowledge benchmarks showing it for o3 or
               | o4-mini?
        
           | riku_iki wrote:
           | Solution could be to mix RL training with foundational
           | knowledge training, so LLM can refresh memory and not forget
           | things.
        
       | macleginn wrote:
       | 'Crucially, all correct solutions from RL-trained models already
       | exist in the base model's distribution, proving RLVR enhances
       | sampling efficiency, not reasoning capacity, while inadvertently
       | shrinking the solution space.' -- wouldn't any kind of RL fail to
       | converge or even progress at all if the solution weren't to be
       | found in the base model distribution? The way training is set up,
       | the models absolutely need to be able to find right solutions in
       | a reasonable time, otherwis there wouldn't be any training
       | signal.
        
         | psb217 wrote:
         | That depends a bit on the length of the RL training and the
         | distribution of problems you're training on. You're correct
         | that RL won't get any "traction" (via positive rewards) on
         | problems where good behavior isn't already in the model's
         | behavior distribution.
         | 
         | However, if you're training on many problems, it's possible in
         | principle that if you have traction on _any_ of the problems,
         | then the learning signal you get from success on those problems
         | will have a positive effect on the model's behavior on other
         | problems. Ie, the learning that you do on problems where the
         | model is already producing positive reward behavior will nudge
         | the model towards producing positive reward behavior on
         | problems where it wasn't previously doing so.
        
           | macleginn wrote:
           | This is an interesting scenario: do you know of any
           | documented examples?
        
             | psb217 wrote:
             | Offhand, I don't know any specific examples for LLMs. In
             | general though, if you google something like "automated
             | curriculum design for reinforcement learning", you should
             | find some relevant references.
             | 
             | Some straightforward scenarios are in, eg, robotics where
             | one can design sequences of increasingly difficult
             | instances of a task like moving objects from one storage
             | bin to another. The basic idea is that the agent would have
             | no reward or learning signal if it jumped straight into the
             | full version of the task, so you let it develop competence
             | on simpler variants and gradually increase difficulty until
             | the agent can get useful learning signal on the full task.
        
       | Der_Einzige wrote:
       | This 100% tracks with my experience.
       | 
       | Also fun stuff many don't know - If you run a regular models chat
       | template with a reasoning tuned model, it can go back to acting
       | like the base model, with no "thinking" process.
       | 
       | "Reasoning" models are not any better than non reasoning models.
       | It's a parlor trick, and benchmarks which claimed it wasn't are
       | bad.
        
         | NitpickLawyer wrote:
         | > If you run a regular models chat template with a reasoning
         | tuned model, it can go back to acting like the base model, with
         | no "thinking" process.
         | 
         | Well, of course. They've been "fine-tuned" with specific chat
         | templates. Remove those and the fine-tune doesn't take
         | precedence anymore. That's expected behaviour I'd say.
         | 
         | > "Reasoning" models are not any better than non reasoning
         | models. It's a parlor trick, and benchmarks which claimed it
         | wasn't are bad.
         | 
         | All of them? Including the closed ones, never public? I highly
         | doubt that.
        
       | whatshisface wrote:
       | If you don't know the answer to a problem, you're not going to be
       | able to repeat sampling until it is correct. Random strings will
       | saturate all benchmarks at k=infinity if tested this way.
        
       | KTibow wrote:
       | I'm a bit skeptical of this until it's proven that they're
       | getting the right answers in the right ways. It could be that
       | base models are just more random and when given 200 guesses out
       | of 1000 possible answers tend to distribute them more evenly,
       | bringing up the pass@k number.
        
         | energy123 wrote:
         | They should try again with higher temperature on the RL model
         | to introduce more variance.
        
       | imtringued wrote:
       | >Our key finding is that all reasoning paths in the RLVR model
       | are already present in the base model.
       | 
       | This is a really good observation. It means that you don't need
       | to RL the full model. You merely need to RL a few LoRAs or maybe
       | a small Mamba model appended to the final layer.
        
       | nialv7 wrote:
       | > we uncover that RL-trained models excel at low k (e.g., pass@1)
       | but are consistently outperformed by base models at high k (e.g.,
       | pass@256).
       | 
       | This is a weak argument. I think I get what we are trying to say,
       | but let's take this to the extreme, say pass@10^10^100. Just like
       | a group of monkeys could write Shakespeare if given enough time,
       | a complete random model could probably outperform an RL-trained
       | model at pass@10^10^100. Would we then say the random model can
       | reason too?
       | 
       | Of course the correct reasoning trace will be in the base model's
       | distribution, just like any other well-formed, coherent
       | paragraph. Kind of makes me think, maybe sampling efficiency _is_
       | intelligence?
        
         | Certhas wrote:
         | If this was just the effect you mention you would not expect
         | the base model to surpass the RL model though. Plus their k are
         | much smaller than that.
         | 
         | I think it's a very interesting and meaningful study.
        
         | seertaak wrote:
         | The authors of the paper address this argument in the QA
         | section.
        
       | iceman_w wrote:
       | RL constrains the space of possible output token sequences to
       | what is likely to lead to the correct answer. So we are
       | inherently making a trade-off to reduce variance. A non-RL model
       | will have higher variance, so given enough attempts, it will come
       | up with some correct answers that an RL model can't.
        
       ___________________________________________________________________
       (page generated 2025-04-22 23:01 UTC)