[HN Gopher] Fine-tuned small LLMs can beat large ones with progr...
___________________________________________________________________
Fine-tuned small LLMs can beat large ones with programmatic data
curation
Author : GabrielBianconi
Score : 33 points
Date : 2025-08-04 15:55 UTC (7 hours ago)
(HTM) web link (www.tensorzero.com)
(TXT) w3m dump (www.tensorzero.com)
| alchemist1e9 wrote:
| I've been thinking about curating primary sources themselves and
| then using those for fine-tuning.
|
| Anyone gone that route and know of projects with very high
| quality curated source materials? ideally categorized and
| labeled.
| k8si wrote:
| Maybe this is a nitpick but CoNLL NER is not a "challenging
| task". Even pre-LLM systems were getting >90 F1 on that as far
| back as 2016.
|
| Also, just in case people want to lit review further on this
| topic: they call their method "programmatic data curation" but I
| believe this approach is also called model distillation and/or
| student-teacher training.
| GabrielBianconi wrote:
| Thanks for the feedback!
|
| We chose a set of tasks with different levels of complexity to
| see how this approach would scale. For LLMs, the "challenge"
| with NER is not the task itself but the arbitrariness of the
| labels in the dataset. I agree it's still much simpler than the
| other tasks we present (agentic RAG, agentic tool use, maze
| navigation).
|
| There are definitely strong parallels to model distillation and
| student-teacher training, with the primary difference being
| that we don't simply take all the data from the larger model
| but rather filter the dataset based on metrics from the
| environment. In the "Does curation even matter?" section, we
| show that this generally improves the result by a good margin.
|
| We link to Vicuna, which might be the closest reference as
| prior art: https://lmsys.org/blog/2023-03-30-vicuna/
|
| Thanks!
| mwigdahl wrote:
| Is this just distillation but with a step to filter out low-
| quality responses first?
| GabrielBianconi wrote:
| AFAIK, distillation typically refers to tuning on the logits of
| the larger model, so you wouldn't be able to do that with fine-
| tuning APIs (OpenAI + Google in our blog post). We fine-tune on
| the outputs themselves.
|
| But broadly speaking, yes, we generate data using a large
| model, curate the best samples using metrics from the
| environment, and fine-tune on that data. This isn't a novel
| technique from an academic perspective; our focus is on
| applying it to different use cases (e.g. agentic RAG, agentic
| tool use) and models (OpenAI, Google, Qwen).
|
| Thanks!
| mwigdahl wrote:
| Thanks for the explanation and the clarification on
| terminology! I've used a similar approach myself and it
| sounded like you were doing something similar.
| littlestymaar wrote:
| > AFAIK, distillation typically refers to tuning on the
| logits of the larger model
|
| I think this is called "logit distillation" which is a
| particular form of distillation but not the only one.
|
| > so you wouldn't be able to do that with fine-tuning APIs
| (OpenAI + Google in our blog post)
|
| Dististillation from competitors' API is so common it has
| been given a name: it's called " _distealing_ ".
| 6510 wrote:
| Noob question: Would it be possible to train a small model for a
| single prompt?
| GabrielBianconi wrote:
| With supervised fine-tuning (SFT), you'll often see good
| results with 100-1000+ datapoints (they can be variations of
| the same prompt template). If you have more limited data,
| reinforcement fine-tuning (RFT) can work well in the 10-100
| range.
|
| Good luck!
___________________________________________________________________
(page generated 2025-08-04 23:01 UTC)