[HN Gopher] Does RL Incentivize Reasoning in LLMs Beyond the Bas...
___________________________________________________________________
Does RL Incentivize Reasoning in LLMs Beyond the Base Model?
Author : leodriesch
Score : 71 points
Date : 2025-04-22 10:24 UTC (12 hours ago)
(HTM) web link (limit-of-rlvr.github.io)
(TXT) w3m dump (limit-of-rlvr.github.io)
| yorwba wrote:
| They write "We manually inspect CoT validity to ensure correct
| answers stem from valid reasoning, not lucky guesses." but the
| example answer they show at the end only gets the correct number
| due to two errors canceling out. The model calculates
| 195+367+562+900 and gets 1924 instead of 2024, and also turns
| -437 - 2*234 into -805 instead of -905, but in total 1924-805 =
| 2024-905 = 1119 and from there the remaining steps are correct
| again.
|
| It would be interesting to know how much of the sampling
| efficiency improvement from reinforcement learning is due to
| being better at basic arithmetic (something which could also be
| achieved by giving the model access to a calculator tool) and how
| much is due to choosing the correct approach for solving the
| problem more often.
| spwa4 wrote:
| I don't like papers that ask a question in the title, so here's
| the answer:
|
| "RL boosts sampling efficiency but reduces the reasoning capacity
| boundary."
|
| Perhaps better to put it like this: Given one, or few attempts,
| RL trained models beat non-RL models. Given many attempts, non-RL
| models come up with better answers.
| sitkack wrote:
| My gut feeling when using DeepSeek is that its performance is a
| lot smoother, the responses feel more robust and not as
| brittle.
| cma wrote:
| At least with Deep Seek math (with the same RL technique as
| the later R1) they noted similar things in their paper in the
| "Why RL Works?" section. Around the 1:04:00 mark of this
| Yannic Kilcher video review of the Deepseek math paper he
| goes over that section and points to basically the same
| limitations as the hn submission paper, starts at around the
| 1hr 4m mark and ends with this: 1:05:40
| the Improvement is attributed to boosting the correct
| response from Top K 1:05:46 rather than the
| enhancement of fundamental capabilities this is something
| that we've come to learn in a 1:05:52 lot of
| different ways from like reinforcement learning on language
| 1:05:58 models or even supervised fine-tuning is that
| what's happening most likely is 1:06:04 more
| that the capabilities of doing all of these things are
| already present in 1:06:09 the underlying
| pre-trained language model
|
| https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4m
|
| from the paper:
|
| > 5.2.2. Why RL Works? > In this paper, we conduct
| reinforcement learning based on a subset of instruction
| tuning data, and it achieves significant performance
| enhancement upon the instruction tuning model. > To further
| explain why reinforcement learning works. We evaluate the
| Pass@K and Maj@K accuracy of the Instruct and RL models on
| two benchmarks. As shown in Figure 7, RL enhances Maj@K's
| performance but not Pass@K. These findings indicate that RL
| enhances the model's overall performance by rendering the
| output distribution more robust, in other words, it seems >
| that the improvement is attributed to boosting the correct
| response from TopK rather than the enhancement of fundamental
| capabilities. Similarly, (Wang et al., 2023a) identified a
| misalignment problem in reasoning tasks within the SFT model,
| showing that the reasoning performance of SFT models can be
| improved through a series of preference alignment strategies
| > (Song et al., 2023; Wang et al., 2023a; Yuan et al.,
| 2023b).
|
| In the video he reads into this that these methods alone may
| not at all get us over the data wall and are still
| fundamentally limited by the distribution of the base model
| they augment.
| whatshisface wrote:
| I don't know a lot about this but it seems like if the
| sampling performance was adequate, external checks like
| theorem verification would work to get "over the data
| wall."
| cma wrote:
| There have already been good results there with
| DeepMind's math Olympiad work. I think the LLM portion
| there was only for translating from informal to formal in
| the training process and in the final process they still
| used a manual translation to a formal description and the
| solver was transformer based and RL trained, but I think
| not starting with any language base, but it was able to
| learn some distribution helpful in solving the problems
| with RL, verifier,and light scaffolding of the tree
| search alone.
| GloamingNiblets wrote:
| Thanks for sharing. I had trouble reading the transcript,
| so here is Claude's cleaned up version and summary:
|
| Here's the condensed and formatted transcription in a
| single paragraph: This is the last thing I want to
| highlight this section on why RL works. Here they evaluate
| different things - they evaluate specifically pass at K and
| maj at K. Maj at K is like majority voting, so what you do
| is you have a model, you have a question, and you output
| not just one output but an ordered set. So you give your
| top 20 answers - 0 is your best answer that the model wants
| to give most, then the second most answer, third most
| answer, and so on. They could all be correct, just
| different reformulations of the same answer or different
| derivations stated in different ways. What you're
| interested in is how many of the top K results are correct
| - that's the pass at K. And if you had to vote if majority
| voting on the top K, how often would you be correct then?
| There's a slight difference, and that slight difference is
| actually made more drastic by reinforcement learning. They
| say, "As shown in figure 7, reinforcement learning enhances
| majority at K performance but not pass at K." These
| findings indicate that reinforcement learning enhances the
| model's overall performance by rendering the output
| distribution more robust. In other words, it seems that the
| improvement is attributed to boosting the correct response
| from Top K rather than the enhancement of fundamental
| capabilities. This is something we've come to learn in many
| different ways from reinforcement learning on language
| models or even supervised fine-tuning - what's happening
| most likely is that the capabilities of doing all of these
| things are already present in the underlying pre-trained
| language model. Summary: Reinforcement learning improves
| language model performance not by enhancing fundamental
| capabilities but by making the output distribution more
| robust, effectively boosting correct responses within the
| top results rather than improving the model's inherent
| abilities.
| cma wrote:
| I'm pretty sure RL causes catastrophic forgetting of its base
| knowledge and that's why o3 hallucinates so much more.
|
| If you mess around with trained weights you're going to delete
| some base knowledge, as least the knowledge that is outside of
| the tasks you RL on.
| kadushka wrote:
| Hallucinations usually happen when a model never knew the
| answer, not when it forgot something.
| cma wrote:
| I think this is definitely not true of catastrophic
| forgetting from finetuning. And with other related types of
| forgetting from model abliteration there are often extreme
| increases hallucination.
|
| The InstructGPT paper also showed that RLHF made
| hallucination worse (with more user data rejecting common
| hallucinations instruction tuning and RLHF may lower
| specific hallucinations rejected by users though).
|
| Some mention of that here: https://huyenchip.com/2023/05/02
| /rlhf.html#rlhf_and_hallucin...
| kadushka wrote:
| RL might be making hallucinations worse, that's true. Why
| do you think RL is causing catastrophic forgetting? Are
| there factual knowledge benchmarks showing it for o3 or
| o4-mini?
| riku_iki wrote:
| Solution could be to mix RL training with foundational
| knowledge training, so LLM can refresh memory and not forget
| things.
| macleginn wrote:
| 'Crucially, all correct solutions from RL-trained models already
| exist in the base model's distribution, proving RLVR enhances
| sampling efficiency, not reasoning capacity, while inadvertently
| shrinking the solution space.' -- wouldn't any kind of RL fail to
| converge or even progress at all if the solution weren't to be
| found in the base model distribution? The way training is set up,
| the models absolutely need to be able to find right solutions in
| a reasonable time, otherwis there wouldn't be any training
| signal.
| psb217 wrote:
| That depends a bit on the length of the RL training and the
| distribution of problems you're training on. You're correct
| that RL won't get any "traction" (via positive rewards) on
| problems where good behavior isn't already in the model's
| behavior distribution.
|
| However, if you're training on many problems, it's possible in
| principle that if you have traction on _any_ of the problems,
| then the learning signal you get from success on those problems
| will have a positive effect on the model's behavior on other
| problems. Ie, the learning that you do on problems where the
| model is already producing positive reward behavior will nudge
| the model towards producing positive reward behavior on
| problems where it wasn't previously doing so.
| macleginn wrote:
| This is an interesting scenario: do you know of any
| documented examples?
| psb217 wrote:
| Offhand, I don't know any specific examples for LLMs. In
| general though, if you google something like "automated
| curriculum design for reinforcement learning", you should
| find some relevant references.
|
| Some straightforward scenarios are in, eg, robotics where
| one can design sequences of increasingly difficult
| instances of a task like moving objects from one storage
| bin to another. The basic idea is that the agent would have
| no reward or learning signal if it jumped straight into the
| full version of the task, so you let it develop competence
| on simpler variants and gradually increase difficulty until
| the agent can get useful learning signal on the full task.
| Der_Einzige wrote:
| This 100% tracks with my experience.
|
| Also fun stuff many don't know - If you run a regular models chat
| template with a reasoning tuned model, it can go back to acting
| like the base model, with no "thinking" process.
|
| "Reasoning" models are not any better than non reasoning models.
| It's a parlor trick, and benchmarks which claimed it wasn't are
| bad.
| NitpickLawyer wrote:
| > If you run a regular models chat template with a reasoning
| tuned model, it can go back to acting like the base model, with
| no "thinking" process.
|
| Well, of course. They've been "fine-tuned" with specific chat
| templates. Remove those and the fine-tune doesn't take
| precedence anymore. That's expected behaviour I'd say.
|
| > "Reasoning" models are not any better than non reasoning
| models. It's a parlor trick, and benchmarks which claimed it
| wasn't are bad.
|
| All of them? Including the closed ones, never public? I highly
| doubt that.
| whatshisface wrote:
| If you don't know the answer to a problem, you're not going to be
| able to repeat sampling until it is correct. Random strings will
| saturate all benchmarks at k=infinity if tested this way.
| KTibow wrote:
| I'm a bit skeptical of this until it's proven that they're
| getting the right answers in the right ways. It could be that
| base models are just more random and when given 200 guesses out
| of 1000 possible answers tend to distribute them more evenly,
| bringing up the pass@k number.
| energy123 wrote:
| They should try again with higher temperature on the RL model
| to introduce more variance.
| imtringued wrote:
| >Our key finding is that all reasoning paths in the RLVR model
| are already present in the base model.
|
| This is a really good observation. It means that you don't need
| to RL the full model. You merely need to RL a few LoRAs or maybe
| a small Mamba model appended to the final layer.
| nialv7 wrote:
| > we uncover that RL-trained models excel at low k (e.g., pass@1)
| but are consistently outperformed by base models at high k (e.g.,
| pass@256).
|
| This is a weak argument. I think I get what we are trying to say,
| but let's take this to the extreme, say pass@10^10^100. Just like
| a group of monkeys could write Shakespeare if given enough time,
| a complete random model could probably outperform an RL-trained
| model at pass@10^10^100. Would we then say the random model can
| reason too?
|
| Of course the correct reasoning trace will be in the base model's
| distribution, just like any other well-formed, coherent
| paragraph. Kind of makes me think, maybe sampling efficiency _is_
| intelligence?
| Certhas wrote:
| If this was just the effect you mention you would not expect
| the base model to surpass the RL model though. Plus their k are
| much smaller than that.
|
| I think it's a very interesting and meaningful study.
| seertaak wrote:
| The authors of the paper address this argument in the QA
| section.
| iceman_w wrote:
| RL constrains the space of possible output token sequences to
| what is likely to lead to the correct answer. So we are
| inherently making a trade-off to reduce variance. A non-RL model
| will have higher variance, so given enough attempts, it will come
| up with some correct answers that an RL model can't.
___________________________________________________________________
(page generated 2025-04-22 23:01 UTC)