[HN Gopher] Implicit Actor Critic Coupling via a Supervised Lear...
___________________________________________________________________
Implicit Actor Critic Coupling via a Supervised Learning Framework
for RLVR
Author : getnormality
Score : 31 points
Date : 2025-10-05 17:01 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| getnormality wrote:
| I stumbled across this AI paper just now. It sounds
| intimidatingly technical, but if you read the abstract and look
| at Figures 1 and 2 and Equation 6, I think it's got some neat and
| accessible conceptual ideas.
|
| Supervised learning is a much more mature technology than
| reinforcement learning, so it seems like a good thing to leverage
| that.
| yorwba wrote:
| I think you meant to link to
|
| _Implicit Actor Critic Coupling via a Supervised Learning
| Framework for RLVR_ https://arxiv.org/abs/2509.02522
|
| not
|
| _Winning Gold at IMO 2025 with a Model-Agnostic Verification-
| and-Refinement Pipeline_ https://arxiv.org/abs/2507.15855
| dang wrote:
| We've changed the top link to that from
| https://arxiv.org/abs/2507.15855. Thanks!
| getnormality wrote:
| Ack, thank you.
| impossiblefork wrote:
| That paper is really cool too though. I'm happy that your
| comment sort of records the old link, because I only saw the
| right paper.
| anfego wrote:
| Is this DPO?
| getnormality wrote:
| I have no idea. My understanding of this entire field is
| extremely superficial. I only posted this because I was able to
| sort of understand the paper despite that.
|
| I can tell you that they cite the DPO paper right before
| Equation 8.
| impossiblefork wrote:
| 59.76% on AIME is really appealing. Without having had time to
| understand it and determine whether it's useful or not, I see
| this number as indicating that this could be a stepping stone on
| something like the o1-to-DeepSeek-R1 progression for thinking,
| where open source models eventually figured out how o1 worked,
| only for the less definite 'o1' and instead what Google achieved
| and OpenAI may have achieved on the 2025 IMO problems.
| radarsat1 wrote:
| > By treating the outcome reward as a predictable label, we
| reformulate the RLVR problem into a supervised learning task over
| a score function parameterized by the policy model and optimized
| using cross-entropy loss.
|
| Isn't this how the Decision Transformer works? I don't see it in
| the references, so I'll be curious to compare the papers in more
| depth.
|
| https://arxiv.org/abs/2106.01345
|
| > By conditioning an autoregressive model on the desired return
| (reward), past states, and actions, our Decision Transformer
| model can generate future actions that achieve the desired
| return.
|
| Lately it has crossed my mind that I haven't seen DT brought up
| much lately, it seemed really interesting when it was first
| published but I haven't read much follow-up work.
___________________________________________________________________
(page generated 2025-10-05 23:00 UTC)