[HN Gopher] Reinforcement Pre-Training
       ___________________________________________________________________
        
       Reinforcement Pre-Training
        
       Author : frozenseven
       Score  : 51 points
       Date   : 2025-06-10 05:30 UTC (17 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | hzia wrote:
       | This is very exciting! Existing data will become a lot more
       | valuable and it brings it one step closer to how we learn as
       | humans!
       | 
       | The downside is that this is going to be extremely expensive, so
       | the data set to conduct RL will need to be curated.
        
         | watsonmusic wrote:
         | cannot wait seeing how it goes beyond the current llm training
         | pipeline
        
           | nsagent wrote:
           | It's clear that you're either one of the authors or a friend
           | of theirs. You created this account 8 months ago to comment
           | on another paper [1] that was released by the same authors.
           | 
           | [1]: https://news.ycombinator.com/item?id=41776324
        
       | dgshsg wrote:
       | I notice that you can do this recursively to arbitrary depth. The
       | cost is terrible though.
        
         | watsonmusic wrote:
         | it could be adaptive. only high-value tokens were allocated
         | with more compute
        
       | babelfish wrote:
       | So marginally better (and occasionally worse) performance for an
       | order of magnitude larger training costs...?
        
         | watsonmusic wrote:
         | 14b model performs comparably with 32b size. the improvement is
         | huge
        
           | 85392_school wrote:
           | are we only comparing them in terms of text completion
           | accuracy? does it also improve perf on benchmarks?
        
       | watsonmusic wrote:
       | A new scaling paradigm finally comes out!
        
       | beauzero wrote:
       | Interesting
        
       | NotAnOtter wrote:
       | I'm interested how an innovation like this affects the business
       | prospects.
       | 
       | Let's assume this is a paradigm shift on the scale of
       | Transformers / `Attention is all you need`. Companies build out
       | new models and pump another $100 Billion through it. And then a
       | year from now, another innovation comes out. Same circus. And
       | again.
       | 
       | No one wants to be left behind but trying to keep up will sink
       | smaller companies.
        
         | curious_cat_163 wrote:
         | I am not sure why this ought to require "pump another $100
         | Billion". Could you elaborate?
         | 
         | Yes, the more recent generation of GPUs optimize for attention
         | math. But they are still fairly "general-purpose" accelerators
         | as well. So when I see papers like this (interesting idea,
         | btw!), my mental model for costs suggests that the CapEx to buy
         | up the GPUs and build out the data centers would get re-used
         | for this and 100s of other ideas and experiments.
         | 
         | And then the hope is that the best ideas will occupy more of
         | the available capacity...
        
         | gessha wrote:
         | Sir, this is an arxiv paper
        
           | NotAnOtter wrote:
           | So true, just like this one: https://arxiv.org/abs/1706.03762
        
       | Imnimo wrote:
       | This is an interesting way of squeezing extra feedback from raw
       | text, but I'm a little skeptical that it's the best way to spend
       | training flops. It feels like most "next tokens" are pretty low
       | information (even after filtering for entropy like they do). Does
       | it make sense to spend a bunch of compute on a reasoning trace on
       | them? Maybe if you're harshly data limited, but not compute
       | limited?
        
       | rafaelero wrote:
       | This should be used for high entropy tokens during pre-training.
        
       | ntonozzi wrote:
       | Is there any work related to using some kind of soft tokens for
       | reasoning? It seems so inefficient to try to encode so much
       | information down into a single token for the next pass of the
       | model, when you could output a large vector for each forward
       | pass, and have a drastically larger working memory/scratchpad,
       | and have much higher bandwidth for the models to pass information
       | forward to the next token call. If a single token has 17 bits of
       | information, a vector of 1024 floats could have 32,768 bits of
       | information.
        
       ___________________________________________________________________
       (page generated 2025-06-10 23:01 UTC)