hngopher.com

       [HN Gopher] Show HN: ART - a new open-source RL framework for tr...
       ___________________________________________________________________
        
       Show HN: ART - a new open-source RL framework for training agents
        
       Hey HN, I wanted to share a new project we've been working on for
       the last couple of months called ART
       (https://github.com/OpenPipe/ART).  ART is a new open-source
       framework for training agents using reinforcement learning (RL). RL
       allows you to train an agent to perform better at any task whose
       outcome can be measured and quantified.  There are many excellent
       projects focused on training LLMs with RL, such as GRPOTrainer
       (https://huggingface.co/docs/trl/main/en/grpo_trainer) and verl
       (https://github.com/volcengine/verl). We've used these frameworks
       extensively for customer-facing projects at OpenPipe, but grew
       frustrated with some key limitations:  - Multi-turn workflows,
       where the agent calls a tool, gets a response, and calls another,
       are not well supported. This makes them a non-starter for any task
       that requires an agent to perform a sequence of actions.  - Other
       frameworks typically have low GPU efficiency. They may require
       multiple H100 GPUs just to train a small 7B parameter model, and
       aren't able to keep the GPUs busy consistently during both the
       "rollout" and "training" phases of the training loop.  - Existing
       frameworks are typically not a convenient shape for integrating
       with existing agentic codebases. Existing trainers expect you to
       call raw text completion endpoints, and don't automatically provide
       industry-standard chat completion APIs.  ART is designed to address
       these limitations and make it easy to train high-quality agents.
       We've also shared many details and practical lessons learned is in
       this post, which walks through a demo of training an email research
       agent that outperforms o3 (https://openpipe.ai/blog/art-e-mail-
       agent). You can also find out more about ART's architecture in our
       announcement post (https://openpipe.ai/blog/art-trainer-a-new-rl-
       trainer-for-ag...).  Happy to answer any questions you have!
        
       Author : kcorbitt
       Score  : 62 points
       Date   : 2025-04-30 15:35 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | kcorbitt wrote:
       | Figured now was a good time to post this since we recently got
       | surprisingly good results on training an email research agent.
       | Link is above, but will put it here as well since I think it's a
       | good example of RL's promise: https://openpipe.ai/blog/art-e-
       | mail-agent
        
       | bradhilton wrote:
       | Contributor here, we developed the Agent Reinforcement Trainer
       | (ART) library to make it easy to train LLMs for anything.
       | 
       | No callbacks or straitjacket flows. Instead we serve an OpenAI
       | API-compatible endpoint that you can use as a drop-in replacement
       | for any proprietary APIs you may be hitting.
       | 
       | After collecting responses from the inference API, you can tune
       | the model with your own custom rewards and repeat the process as
       | long as you like, until performance converges. We believe this
       | level of flexibility will make it easier for you to train state-
       | of-the-art models for your own use cases, much like Kyle's new
       | email agent[1].
       | 
       | Also happy to answer any questions you have about the framework.
       | 
       | [1] https://openpipe.ai/blog/art-e-mail-agent
        
       | tcdent wrote:
       | I really like this concept.
       | 
       | Do you have documentation for the API response from the
       | `/_train_model` endpoint?
        
         | bradhilton wrote:
         | Hi, we don't have reliable documentation for the HTTP API
         | endpoints yet, mostly as they are still subject to change.
         | 
         | However, to briefly provide some context, `/_train_model`
         | returns a stream of line delimited JSON objects for each
         | gradient step as the model trains on the provided trajectories
         | so the client can monitor progress. The final version of this
         | endpoint may provide the option for both streaming & non-
         | streaming responses, and/or potentially return a "training job"
         | that can be polled instead.
        
       | someguy101010 wrote:
       | Thanks for sharing this! A couple of questions come to mind:
       | 
       | - How does training with RL differ from fine tuning?
       | 
       | - When would it make sense to fine tune instead of using RL?
        
         | kcorbitt wrote:
         | Ok good questions here.
         | 
         | By fine-tuning in this context I assume you mean "supervised
         | fine-tuning", or SFT. SFT trains a model to produce a specific
         | string of output tokens, given an input. With SFT, if you were
         | trying to train an assistant to solve math problems using a
         | code interpreter, you might train it on a dataset that looks
         | like:                   input: 'What is 934+1208'
         | output: `print(934+1208)`              input: 'how many "r"s in
         | strawberry'         output: `print(len([l for l in "strawberry"
         | if l == 'r'])`
         | 
         | etc, etc.
         | 
         | RL, on the other hand, just means training a model not to
         | produce a concrete string of output tokens, but rather to
         | create an output that maximizes some reward function (you get
         | to decide on the reward).
         | 
         | For the example above, you might create the following dataset
         | for RL training:                   input: 'What is 934+1208'
         | ground_truth: 2142              input: 'how many "r"s in
         | strawberry'         ground_truth: 3
         | 
         | You would then train the model to write python code that
         | produces the ground_truth output. Your training code would take
         | the model's output, run the python it produced, and then check
         | whether the output matches the expected ground_truth.
         | Importantly, this doesn't require you actually writing the code
         | to solve the problem (you don't even have to know if it's
         | solvable, technically!). Over time, the training loop would
         | make the model more likely to produce outputs that get high
         | rewards, which hopefully means it gets better at producing
         | valid and applicable python.
         | 
         | This is useful in lots of domains where it's easier to check
         | the answer than actually produce it. In the blog post[1] linked
         | above, we train the agent to effectively use keyword search to
         | try to find the correct emails in an inbox. As the model
         | trainer, I didn't actually know what the right strategy was to
         | choose keywords that would most quickly find the relevant
         | email, but through training with RL, the model was able to
         | figure it out on its own!
         | 
         | [1]: https://openpipe.ai/blog/art-e-mail-
         | agent?refresh=1746030513...
        
           | someguy101010 wrote:
           | Thank you for the detailed response!
        
       | jeffchuber wrote:
       | the table with comparable models is a really great way to show
       | off things here
        
       ___________________________________________________________________
       (page generated 2025-04-30 23:00 UTC)