[HN Gopher] Supervised fine tuning on curated data is reinforcem...
       ___________________________________________________________________
        
       Supervised fine tuning on curated data is reinforcement learning
        
       Author : GabrielBianconi
       Score  : 21 points
       Date   : 2025-07-29 20:18 UTC (2 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | mandevil wrote:
       | Interesting to see two independent researchers on this. Makes me
       | curious as to what the back-story is? Side project?
        
         | babelfish wrote:
         | Especially interesting given they both work for Google
         | DeepMind.
        
         | GabrielBianconi wrote:
         | Yeah, I hadn't noticed!
        
         | jtspringenberg wrote:
         | Author here, just to clarify: we are both no longer working for
         | DeepMind. This was purely an independent effort for the sake of
         | research and understanding! Happy to answer any questions.
        
       | iandanforth wrote:
       | How is this kind of analogy helpful? You can frame _any_
       | optimization problem as RL if you try hard enough. RL is a method
       | of optimization which calls the optimum  "reward maximization".
       | You can craft the reward function any which way you want.
       | 
       | The key point about RL is that it is a sequential decision making
       | process. If you don't have something (an agent) making multiple
       | decisions over time while interacting with an environment, then
       | why bother calling it RL?
        
       | anndvision wrote:
       | We recently ran similar experiments and saw that fine-tuning
       | small models on automatically curated high-quality outputs from a
       | large model can beat large-model performance while reducing
       | inference costs by up to 30x and inference time by up to 4x.
       | 
       | We benchmarked closed-source (OpenAI, Google) and open-source
       | (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG
       | (Multi-Hop), and agentic tool use (t-bench).
       | 
       | We're still running a few experiments and plan to update the post
       | with additional results in a few days.
       | 
       | Looking forward to trying out importance weighting soon!
       | 
       | Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x
       | Lower Cost: https://www.tensorzero.com/blog/curated-behavior-
       | cloning-sma...
        
         | chongliqin wrote:
         | Cool! If you are interested, we have open sourced our code:
         | https://github.com/emmyqin/iw_sft
        
       | henriquegodoy wrote:
       | It's cool to see the perspective that many problems (somekinda
       | communication problems, look at lawyers, compliance and etc...)
       | can be solved by treating AI less as agents and more as modular
       | components within a larger system. Once we build a working
       | process--monitored through evals--we can then reduce costs by
       | distilling these modules. That means starting with
       | superintelligent models and later distilling them down to just a
       | few billion parameters, instead of needing hundreds of billions.
        
       ___________________________________________________________________
       (page generated 2025-07-29 23:00 UTC)