[HN Gopher] Show HN: Llama-8B Teaches Itself Baby Steps to Deep ...
___________________________________________________________________
Show HN: Llama-8B Teaches Itself Baby Steps to Deep Research Using
RL
I've been tinkering with getting Llama-8B to bootstrap its own
research skills through self-play. The model generates questions
about documents, searches for answers, and then learns from its own
successes/failures through RL (hacked up Unsloth's GRPO code).
Started with just 23% accuracy on Apollo 13 mission report
questions and hit 53% after less than an hour of training.
Everything runs locally using open-source models. It's cool to see
the model go from completely botching search queries to iteratively
researching to get the right answer.
Author : diegocaples
Score : 10 points
Date : 2025-03-10 16:05 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mdp2021 wrote:
| It seems like a very nailed project. I understand this works also
| as an engine to optimize an interface over a body of knowledge
| (dataset) you input?
|
| Questions:
|
| -- Does a training over a body of data export into better
| performance over subsequent bodies of data - as you should also
| be training meta-skills?
|
| -- Your benchmark revealed a growth from 23% to 53% after an
| hour: and after further training? If it plateaus, why does it?
| diegocaples wrote:
| Thanks! This is more of an engine to optimize an *LLM to use*
| an interface over a dataset. End-to-end reinforcement learning
| of entire agent pipelines will be an important way to increase
| their reliability.
|
| I haven't tried to switch the dataset, but I am fairly certain
| the LLM is training meta-skills. It seems that the majority of
| what the model learns is to behave in a more reasonable way,
| and to stop hallucinating + improperly using tools. Not to
| memorize the data in the body of knowledge.
|
| During the first hour of training, llama learns most of the low
| hanging fruit (stop messing up function calls and stop
| hallucinating). So after that, learning slows down.
___________________________________________________________________
(page generated 2025-03-10 23:01 UTC)