https://github.com/wgryc/phasellm Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} wgryc / phasellm Public * Notifications * Fork 0 * Star 67 Large language model evaluation and workflow framework from Phase AI. License MIT license 67 stars 0 forks Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights wgryc/phasellm This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/w] Use Git or checkout with SVN using the web URL. [gh repo clone wgryc/] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @wgryc wgryc Update README.md ... 3fb5517 Apr 10, 2023 Update README.md 3fb5517 Git stats * 6 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .gitignore Initial commit April 10, 2023 19:54 LICENSE Initial commit April 10, 2023 19:54 README.md Update README.md April 10, 2023 20:04 llms.py Create llms.py April 10, 2023 20:03 View code PhaseLLM Example: Evaluating Travel Chatbot Prompts with GPT-3.5, Claude, and more Contact Us README.md PhaseLLM Large language model evaluation and workflow framework from Phase AI. The coming months and years will bring thousands of new products and experienced powered by large language models (LLMs) like ChatGPT or its increasing number of variants. Whether you're using OpenAI's ChatGPT, Anthropic's Claude, or something else all together, you'll want to test how well your models and prompts perform against user needs. As more models are launched, you'll also have a bigger range of options. PhaseLLM is a framework designed to help manage and test LLM-driven experiences -- products, content, or other experiences that product and brand managers might be driving for their users. Here's what PhaseLLM does: 1. We standardize API calls so you can plug and play models from OpenAI, Cohere, Anthropic, or other providers. 2. We've built evaluation frameworks so you can compare outputs and decide which ones are driving the best experiences for users. 3. We're adding automations so you can use advanced models (e.g., GPT-4) to evaluate simpler models (e.g., GPT-3) to determine what combination of prompts yield the best experiences, especially when taking into account costs and speed of model execution. PhaseLLM is open source and we envision building more features to help with model understanding. We want to help developers, data scientists, and others launch new, robust products as easily as possible. If you're working on an LLM product, please reach out. We'd love to help out. Example: Evaluating Travel Chatbot Prompts with GPT-3.5, Claude, and more PhaseLLM makes it incredibly easy to plug and play LLMs and evaluate them, in some cases with other LLMs. Suppose you're building a travel chatbot, and you want to test Claude and Cohere against each other, using GPT-3.5. What's awesome with this approach is that (1) you can plug and play models and prompts as needed, and (2) the entire workflow takes a small amount of code. This simple example can easily be scaled to much more complex workflows. So, time for the code... First, load your API keys. from dotenv import load_dotenv load_dotenv() openai_api_key = os.getenv("OPENAI_API_KEY") anthropic_api_key = os.getenv("ANTHROPIC_API_KEY") cohere_api_key = os.getenv("COHERE_API_KEY") We're going to set up the Evaluator, which takes two LLM model outputs and decides which one is better for the objective at hand. # We'll use GPT-3.5 as the evaluator. e = llms.GPT35Evaluator(openai_api_key) Now it's time to set up the experiment. In this case, we'll set up an objective which describes what we're trying to achieve with our chatbot. We'll also provide 5 examples of starting chats that we've seen with our users. objective = "We're building a chatbot to discuss a user's travel preferences and provide advice." # Chats that have been launched by users. travel_chat_starts = [ "I'm planning to visit Poland in spring.", "I'm looking for the cheapest flight to Europe next week.", "I am trying to decide between Prague and Paris for a 5-day trip", "I want to visit Europe but can't decide if spring, summer, or fall would be better.", "I'm unsure I should visit Spain by flying via the UK or via France." ] Now we set up our Cohere and Claude models. claude_model = llms.ClaudeWrapper(anthropic_api_key) Finally, we launch our test. We run an experiments where both models generate a chat response and then we have GPT-3.5 evaluate the response. for tcs in travel_chat_starts: messages = [{"role":"system", "content":objective}, {"role":"user", "content":tcs}] response_cohere = cohere_model.complete_chat(messages, "assistant") response_claude = claude_model.complete_chat(messages, "assistant") pref = e.choose(objective, tcs, response_cohere, response_claude) print(f"{pref}") In this case, we simply print which of the two models was preferred. Voila! You've got a suite to test your models and can plug-and-play three major LLMs. Contact Us If you have questions, requests, ideas, etc. please reach out at w (at) phaseai (dot) com. About Large language model evaluation and workflow framework from Phase AI. Resources Readme License MIT license Stars 67 stars Watchers 1 watching Forks 0 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.