[HN Gopher] Scaling Up Test-Time Compute with Latent Reasoning: ...
       ___________________________________________________________________
        
       Scaling Up Test-Time Compute with Latent Reasoning: A Recurrent
       Depth Approach
        
       Author : timbilt
       Score  : 53 points
       Date   : 2025-02-10 19:50 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | timbilt wrote:
       | Twitter thread about this by the author:
       | https://x.com/jonasgeiping/status/1888985929727037514
        
         | danielbln wrote:
         | If you don't have a twitter account and want to read the full
         | thread:
         | https://xcancel.com/jonasgeiping/status/1888985929727037514
        
       | tmnvdb wrote:
       | Interesting stuff. As the authors note, using latent reasoning
       | seems to be a way to sink more compute into the model and get
       | better performance without increasing the model size, good news
       | for those on a steady diet of 'scale pills'
        
       | HarHarVeryFunny wrote:
       | Latent / embedding-space reasoning seems a step in the right
       | direction, but building recurrence into the model while still
       | relying on gradient descent (i.e. BPTT) to train it seems to
       | create more of a problem (training inefficiency) than it solves,
       | especially since they still end up externally specifying the
       | number of recurrent iterations (r=4, 8, etc) for a given
       | inference. Ideally having recurrence internal to the model would
       | allow the model itself to decide how long to iterate for before
       | outputting anything.
        
         | thomasahle wrote:
         | > Latent / embedding-space reasoning seems a step in the right
         | direction
         | 
         | Might be good for reasoning, but it's terrible for
         | interpretation / AI-safety.
        
         | janalsncm wrote:
         | > seems a step in the right direction
         | 
         | I can't see why. I can't think of any problems where recurrent
         | loops with latent streams would be preferable to tokens. And
         | the downsides are obvious.
         | 
         | > externally specifying the number of recurrent iterations
         | 
         | Yeah this seems wrong to me. At least with RL training you saw
         | that the length of the CoT decreased dramatically before
         | climbing again, as the model became more proficient.
        
       | janalsncm wrote:
       | One of the benefits of using thinking tokens compared to
       | "thinking in a latent" space is that you can directly observe the
       | quality of the CoT. In R1 they saw it was mixing languages and
       | fixed it with cold start data.
       | 
       | It would be hard to SFT this because you can only SFT the final
       | result not the latent space.
       | 
       | I also notice the authors only had compute for a single full
       | training run. It's impressive they saw such good results from
       | that, but I wonder if they could get better results by
       | incorporating recent efficiency improvements.
       | 
       | I would personally not use this architecture because 1) it adds a
       | lot of hyperparameters which don't have a strong theoretical
       | grounding and 2) it's not clearly better than simpler methods.
        
       ___________________________________________________________________
       (page generated 2025-02-10 23:00 UTC)