[HN Gopher] Scaling Up Test-Time Compute with Latent Reasoning: ...
___________________________________________________________________
Scaling Up Test-Time Compute with Latent Reasoning: A Recurrent
Depth Approach
Author : timbilt
Score : 53 points
Date : 2025-02-10 19:50 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| timbilt wrote:
| Twitter thread about this by the author:
| https://x.com/jonasgeiping/status/1888985929727037514
| danielbln wrote:
| If you don't have a twitter account and want to read the full
| thread:
| https://xcancel.com/jonasgeiping/status/1888985929727037514
| tmnvdb wrote:
| Interesting stuff. As the authors note, using latent reasoning
| seems to be a way to sink more compute into the model and get
| better performance without increasing the model size, good news
| for those on a steady diet of 'scale pills'
| HarHarVeryFunny wrote:
| Latent / embedding-space reasoning seems a step in the right
| direction, but building recurrence into the model while still
| relying on gradient descent (i.e. BPTT) to train it seems to
| create more of a problem (training inefficiency) than it solves,
| especially since they still end up externally specifying the
| number of recurrent iterations (r=4, 8, etc) for a given
| inference. Ideally having recurrence internal to the model would
| allow the model itself to decide how long to iterate for before
| outputting anything.
| thomasahle wrote:
| > Latent / embedding-space reasoning seems a step in the right
| direction
|
| Might be good for reasoning, but it's terrible for
| interpretation / AI-safety.
| janalsncm wrote:
| > seems a step in the right direction
|
| I can't see why. I can't think of any problems where recurrent
| loops with latent streams would be preferable to tokens. And
| the downsides are obvious.
|
| > externally specifying the number of recurrent iterations
|
| Yeah this seems wrong to me. At least with RL training you saw
| that the length of the CoT decreased dramatically before
| climbing again, as the model became more proficient.
| janalsncm wrote:
| One of the benefits of using thinking tokens compared to
| "thinking in a latent" space is that you can directly observe the
| quality of the CoT. In R1 they saw it was mixing languages and
| fixed it with cold start data.
|
| It would be hard to SFT this because you can only SFT the final
| result not the latent space.
|
| I also notice the authors only had compute for a single full
| training run. It's impressive they saw such good results from
| that, but I wonder if they could get better results by
| incorporating recent efficiency improvements.
|
| I would personally not use this architecture because 1) it adds a
| lot of hyperparameters which don't have a strong theoretical
| grounding and 2) it's not clearly better than simpler methods.
___________________________________________________________________
(page generated 2025-02-10 23:00 UTC)