[HN Gopher] Fine-tuned small LLMs can beat large ones with progr...
       ___________________________________________________________________
        
       Fine-tuned small LLMs can beat large ones with programmatic data
       curation
        
       Author : GabrielBianconi
       Score  : 33 points
       Date   : 2025-08-04 15:55 UTC (7 hours ago)
        
 (HTM) web link (www.tensorzero.com)
 (TXT) w3m dump (www.tensorzero.com)
        
       | alchemist1e9 wrote:
       | I've been thinking about curating primary sources themselves and
       | then using those for fine-tuning.
       | 
       | Anyone gone that route and know of projects with very high
       | quality curated source materials? ideally categorized and
       | labeled.
        
       | k8si wrote:
       | Maybe this is a nitpick but CoNLL NER is not a "challenging
       | task". Even pre-LLM systems were getting >90 F1 on that as far
       | back as 2016.
       | 
       | Also, just in case people want to lit review further on this
       | topic: they call their method "programmatic data curation" but I
       | believe this approach is also called model distillation and/or
       | student-teacher training.
        
         | GabrielBianconi wrote:
         | Thanks for the feedback!
         | 
         | We chose a set of tasks with different levels of complexity to
         | see how this approach would scale. For LLMs, the "challenge"
         | with NER is not the task itself but the arbitrariness of the
         | labels in the dataset. I agree it's still much simpler than the
         | other tasks we present (agentic RAG, agentic tool use, maze
         | navigation).
         | 
         | There are definitely strong parallels to model distillation and
         | student-teacher training, with the primary difference being
         | that we don't simply take all the data from the larger model
         | but rather filter the dataset based on metrics from the
         | environment. In the "Does curation even matter?" section, we
         | show that this generally improves the result by a good margin.
         | 
         | We link to Vicuna, which might be the closest reference as
         | prior art: https://lmsys.org/blog/2023-03-30-vicuna/
         | 
         | Thanks!
        
       | mwigdahl wrote:
       | Is this just distillation but with a step to filter out low-
       | quality responses first?
        
         | GabrielBianconi wrote:
         | AFAIK, distillation typically refers to tuning on the logits of
         | the larger model, so you wouldn't be able to do that with fine-
         | tuning APIs (OpenAI + Google in our blog post). We fine-tune on
         | the outputs themselves.
         | 
         | But broadly speaking, yes, we generate data using a large
         | model, curate the best samples using metrics from the
         | environment, and fine-tune on that data. This isn't a novel
         | technique from an academic perspective; our focus is on
         | applying it to different use cases (e.g. agentic RAG, agentic
         | tool use) and models (OpenAI, Google, Qwen).
         | 
         | Thanks!
        
           | mwigdahl wrote:
           | Thanks for the explanation and the clarification on
           | terminology! I've used a similar approach myself and it
           | sounded like you were doing something similar.
        
           | littlestymaar wrote:
           | > AFAIK, distillation typically refers to tuning on the
           | logits of the larger model
           | 
           | I think this is called "logit distillation" which is a
           | particular form of distillation but not the only one.
           | 
           | > so you wouldn't be able to do that with fine-tuning APIs
           | (OpenAI + Google in our blog post)
           | 
           | Dististillation from competitors' API is so common it has
           | been given a name: it's called " _distealing_ ".
        
       | 6510 wrote:
       | Noob question: Would it be possible to train a small model for a
       | single prompt?
        
         | GabrielBianconi wrote:
         | With supervised fine-tuning (SFT), you'll often see good
         | results with 100-1000+ datapoints (they can be variations of
         | the same prompt template). If you have more limited data,
         | reinforcement fine-tuning (RFT) can work well in the 10-100
         | range.
         | 
         | Good luck!
        
       ___________________________________________________________________
       (page generated 2025-08-04 23:01 UTC)