[HN Gopher] Show HN: Llama-8B Teaches Itself Baby Steps to Deep ...
       ___________________________________________________________________
        
       Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using
       RL
        
       I've been tinkering with getting Llama-8B to bootstrap its own
       research skills through self-play. The model generates questions
       about documents, searches for answers, and then learns from its own
       successes/failures through RL (hacked up Unsloth's GRPO code).
       Started with just 23% accuracy on Apollo 13 mission report
       questions and hit 53% after less than an hour of training.
       Everything runs locally using open-source models. It's cool to see
       the model go from completely botching search queries to iteratively
       researching to get the right answer.
        
       Author : diegocaples
       Score  : 10 points
       Date   : 2025-03-10 16:05 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mdp2021 wrote:
       | It seems like a very nailed project. I understand this works also
       | as an engine to optimize an interface over a body of knowledge
       | (dataset) you input?
       | 
       | Questions:
       | 
       | -- Does a training over a body of data export into better
       | performance over subsequent bodies of data - as you should also
       | be training meta-skills?
       | 
       | -- Your benchmark revealed a growth from 23% to 53% after an
       | hour: and after further training? If it plateaus, why does it?
        
         | diegocaples wrote:
         | Thanks! This is more of an engine to optimize an *LLM to use*
         | an interface over a dataset. End-to-end reinforcement learning
         | of entire agent pipelines will be an important way to increase
         | their reliability.
         | 
         | I haven't tried to switch the dataset, but I am fairly certain
         | the LLM is training meta-skills. It seems that the majority of
         | what the model learns is to behave in a more reasonable way,
         | and to stop hallucinating + improperly using tools. Not to
         | memorize the data in the body of knowledge.
         | 
         | During the first hour of training, llama learns most of the low
         | hanging fruit (stop messing up function calls and stop
         | hallucinating). So after that, learning slows down.
        
       ___________________________________________________________________
       (page generated 2025-03-10 23:01 UTC)