https://github.com/KhoomeiK/LlamaGym Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} KhoomeiK / LlamaGym Public * Notifications * Fork 10 * Star 279 * Fine-tune LLM agents with online reinforcement learning License MIT license 279 stars 10 forks Branches Tags Activity Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights KhoomeiK/LlamaGym This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit message Last commit date Latest commit History 32 Commits examples examples llamagym llamagym .gitignore .gitignore LICENSE LICENSE README.md README.md llamagym.png llamagym.png poetry.lock poetry.lock pyproject.toml pyproject.toml View all files Repository files navigation * README * MIT license Llama Gym Fine-tune LLM agents with online reinforcement learning Python Version Agents for Web Data Extraction * Twitter LlamaGym "Agents" originated in reinforcement learning, where they learn by interacting with an environment and receiving a reward signal. However, LLM-based agents today do not learn online (i.e. continuously in real time) via reinforcement. OpenAI created Gym to standardize and simplify RL environments, but if you try dropping an LLM-based agent into a Gym environment for training, you'd find it's still quite a bit of code to handle LLM conversation context, episode batches, reward assignment, PPO setup, and more. LlamaGym seeks to simplify fine-tuning LLM agents with RL. Right now, it's a single Agent abstract class that handles all the issues mentioned above, letting you quickly iterate and experiment with agent prompting & hyperparameters across any Gym environment. Usage Fine-tuning an LLM-based agent to play in a Gym-style environment with RL has never been easier! Once you install LlamaGym... pip install llamagym First, implement 3 abstract methods on the Agent class: from llamagym import Agent class BlackjackAgent(Agent): def get_system_prompt(self) -> str: return "You are an expert blackjack player." def format_observation(self, observation) -> str: return f"Your current total is {observation[0]}" def extract_action(self, response: str): return 0 if "stay" in response else 1 Then, define your base LLM (as you would for any fine-tuning job) and instantiate your agent: model = AutoModelForCausalLMWithValueHead.from_pretrained("Llama-2-7b").to(device) tokenizer = AutoTokenizer.from_pretrained("Llama-2-7b") agent = BlackjackAgent(model, tokenizer, device) Finally, write your RL loop as usual and simply call your agent to act, reward, and terminate: env = gym.make("Blackjack-v1") for episode in trange(5000): observation, info = env.reset() done = False while not done: action = agent.act(observation) # act based on observation observation, reward, terminated, truncated, info = env.step(action) agent.assign_reward(reward) # provide reward to agent done = terminated or truncated train_stats = agent.terminate_episode() # trains if batch is full Some reminders: * above code snippets are mildly simplified above but a fully working example is available in examples/blackjack.py * getting online RL to converge is notoriously difficult so you'll have to mess with hyperparameters to see improvement + your model may also benefit from a supervised fine-tuning stage on sampled trajectories before running RL (we may add this feature in the future) * our implementation values simplicity so is not as compute efficient as e.g. Lamorel, but easier to start playing around with * LlamaGym is a weekend project and still a WIP, but we love contributions! Relevant Work * Grounding Large Language Models with Online Reinforcement Learning + Lamorel: Language Models for Reinforcement Learning * True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning Citation bibtex @misc{pandey2024llamagym, title = {LlamaGym: Fine-tune LLM agents with Online Reinforcement Learning}, author = {Rohan Pandey}, year = {2024}, howpublished = {GitHub}, url = {https://github.com/KhoomeiK/LlamaGym} } About Fine-tune LLM agents with online reinforcement learning Resources Readme License MIT license Activity Stars 279 stars Watchers 5 watching Forks 10 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.