[HN Gopher] Implicit Actor Critic Coupling via a Supervised Lear...
       ___________________________________________________________________
        
       Implicit Actor Critic Coupling via a Supervised Learning Framework
       for RLVR
        
       Author : getnormality
       Score  : 31 points
       Date   : 2025-10-05 17:01 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | getnormality wrote:
       | I stumbled across this AI paper just now. It sounds
       | intimidatingly technical, but if you read the abstract and look
       | at Figures 1 and 2 and Equation 6, I think it's got some neat and
       | accessible conceptual ideas.
       | 
       | Supervised learning is a much more mature technology than
       | reinforcement learning, so it seems like a good thing to leverage
       | that.
        
       | yorwba wrote:
       | I think you meant to link to
       | 
       |  _Implicit Actor Critic Coupling via a Supervised Learning
       | Framework for RLVR_ https://arxiv.org/abs/2509.02522
       | 
       | not
       | 
       |  _Winning Gold at IMO 2025 with a Model-Agnostic Verification-
       | and-Refinement Pipeline_ https://arxiv.org/abs/2507.15855
        
         | dang wrote:
         | We've changed the top link to that from
         | https://arxiv.org/abs/2507.15855. Thanks!
        
         | getnormality wrote:
         | Ack, thank you.
        
         | impossiblefork wrote:
         | That paper is really cool too though. I'm happy that your
         | comment sort of records the old link, because I only saw the
         | right paper.
        
       | anfego wrote:
       | Is this DPO?
        
         | getnormality wrote:
         | I have no idea. My understanding of this entire field is
         | extremely superficial. I only posted this because I was able to
         | sort of understand the paper despite that.
         | 
         | I can tell you that they cite the DPO paper right before
         | Equation 8.
        
       | impossiblefork wrote:
       | 59.76% on AIME is really appealing. Without having had time to
       | understand it and determine whether it's useful or not, I see
       | this number as indicating that this could be a stepping stone on
       | something like the o1-to-DeepSeek-R1 progression for thinking,
       | where open source models eventually figured out how o1 worked,
       | only for the less definite 'o1' and instead what Google achieved
       | and OpenAI may have achieved on the 2025 IMO problems.
        
       | radarsat1 wrote:
       | > By treating the outcome reward as a predictable label, we
       | reformulate the RLVR problem into a supervised learning task over
       | a score function parameterized by the policy model and optimized
       | using cross-entropy loss.
       | 
       | Isn't this how the Decision Transformer works? I don't see it in
       | the references, so I'll be curious to compare the papers in more
       | depth.
       | 
       | https://arxiv.org/abs/2106.01345
       | 
       | > By conditioning an autoregressive model on the desired return
       | (reward), past states, and actions, our Decision Transformer
       | model can generate future actions that achieve the desired
       | return.
       | 
       | Lately it has crossed my mind that I haven't seen DT brought up
       | much lately, it seemed really interesting when it was first
       | published but I haven't read much follow-up work.
        
       ___________________________________________________________________
       (page generated 2025-10-05 23:00 UTC)