[HN Gopher] Supervised fine tuning on curated data is reinforcem...
___________________________________________________________________
Supervised fine tuning on curated data is reinforcement learning
Author : GabrielBianconi
Score : 21 points
Date : 2025-07-29 20:18 UTC (2 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| mandevil wrote:
| Interesting to see two independent researchers on this. Makes me
| curious as to what the back-story is? Side project?
| babelfish wrote:
| Especially interesting given they both work for Google
| DeepMind.
| GabrielBianconi wrote:
| Yeah, I hadn't noticed!
| jtspringenberg wrote:
| Author here, just to clarify: we are both no longer working for
| DeepMind. This was purely an independent effort for the sake of
| research and understanding! Happy to answer any questions.
| iandanforth wrote:
| How is this kind of analogy helpful? You can frame _any_
| optimization problem as RL if you try hard enough. RL is a method
| of optimization which calls the optimum "reward maximization".
| You can craft the reward function any which way you want.
|
| The key point about RL is that it is a sequential decision making
| process. If you don't have something (an agent) making multiple
| decisions over time while interacting with an environment, then
| why bother calling it RL?
| anndvision wrote:
| We recently ran similar experiments and saw that fine-tuning
| small models on automatically curated high-quality outputs from a
| large model can beat large-model performance while reducing
| inference costs by up to 30x and inference time by up to 4x.
|
| We benchmarked closed-source (OpenAI, Google) and open-source
| (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG
| (Multi-Hop), and agentic tool use (t-bench).
|
| We're still running a few experiments and plan to update the post
| with additional results in a few days.
|
| Looking forward to trying out importance weighting soon!
|
| Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x
| Lower Cost: https://www.tensorzero.com/blog/curated-behavior-
| cloning-sma...
| chongliqin wrote:
| Cool! If you are interested, we have open sourced our code:
| https://github.com/emmyqin/iw_sft
| henriquegodoy wrote:
| It's cool to see the perspective that many problems (somekinda
| communication problems, look at lawyers, compliance and etc...)
| can be solved by treating AI less as agents and more as modular
| components within a larger system. Once we build a working
| process--monitored through evals--we can then reduce costs by
| distilling these modules. That means starting with
| superintelligent models and later distilling them down to just a
| few billion parameters, instead of needing hundreds of billions.
___________________________________________________________________
(page generated 2025-07-29 23:00 UTC)