https://github.com/Ying1123/FlexGen Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} Ying1123 / FlexGen Public * Notifications * Fork 26 * Star 739 Running large language models like OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems. License Apache-2.0 license 739 stars 26 forks Star Notifications * Code * Issues 4 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights Ying1123/FlexGen This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/Y] Use Git or checkout with SVN using the web URL. [gh repo clone Ying11] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @keroro824 keroro824 Update README.md (#7) ... fff8417 Feb 20, 2023 Update README.md (#7) fff8417 Git stats * 11 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time apps benchmark docs flexgen scripts .gitignore LICENSE README.md setup.py View code [ ] FlexGen Content Benchmark Results Generation Throughput (token/s) Latency-throughput Trade-off How It Works Install Get Started with a Single GPU OPT-1.3B OPT-30B OPT-175B How to set the offloading strategy? Scaling to Distributed GPUs Run Chatbot with OPT models up to 175B on a Single GPU Commands Example output Roadmap Acknowledgement README.md FlexGen FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 or a 24GB RTX3090 gaming card!). Large language models (LLMs) are at the heart of applications like ChatGPT and Copilot, but the high computational and memory requirements of LLM inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU (e.g., T4, 3090) and allow flexible deployment for various hardware setups. The key features of FlexGen include: [?] Lightining Fast Offloading. Up to 100x faster than other offloading-based systems for running 175B models on a single GPU. Extreme Compression. Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss. Scalability. Come with a distributed pipeline parallelism runtime to allow scaling if more GPUs are given. | Read Paper | Join Discord | Content * Benchmark Results * Install * Get Started with a Single GPU * Run Chatbot with OPT models up to 175B on a Single GPU * Scaling to Distributed GPUs * Roadmap Benchmark Results Generation Throughput (token/s) System OPT-6.7B OPT-30B OPT-175B Huggingface Accelerate 25.12 0.62 0.01 DeepSpeed ZeRO-Inference 9.28 0.60 0.01 Petals* - - 0.05 FlexGen 25.26 7.32 0.69 FlexGen with Compression 29.12 8.38 1.12 * Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD. * Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to a value that maximizes the generation throughput for each system. * Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation). How to reproduce. Latency-throughput Trade-off The figure below shows the latency and throughput trade-off of three offloading-based systems on OPT-175B (left) and OPT-30B (right). FlexGen achieves a new Pareto-optimal frontier with a 100x higher maximum throughput for OPT-175B. Other systems cannot further increase throughput due to out-of-memory. ``(c)'' denotes FlexGen with compression. logo How It Works FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value (KV) cache. FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss. One key idea of FlexGen is to play the latency-throughput trade-off. Achieving low latency is inherently challenging for offloading methods, but the efficiency of offloading can be greatly boosted for throughput-oriented scenarios (see the figure above). FlexGen utilizes a block schedule to reuse weight and overlap I/O with computation, as shown in figure (b) below, while other baseline systems use an ineffiicent row-by-row schedule, as shown in figure (a) below. logo More details can be found in our paper. Install Requirements: torch>=1.12 transformers>=4.24 Instructions: git clone https://github.com/Ying1123/FlexGen.git cd FlexGen pip3 install -e . # (Optional) Install openmpi for multi-gpu execution # sudo apt install openmpi-bin Get Started with a Single GPU OPT-1.3B To get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required. FlexGen will automatically download weights from huggingface. python3 -m flexgen.flex_opt --model facebook/opt-1.3b OPT-30B To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below. The --percent arguments specify the offloading strategy for parameters, attention cache and hidden states separately. python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0 OPT-175B To run OPT-175B, you need to download the weights from metaseq and convert the weights into Alpa format. You can then try CPU/disk offloading by python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 0 0 0 0 --offload-dir YOUR_SSD_FOLDER How to set the offloading strategy? We will release an automatic policy optimizer later, but now you have to manually try a few strategies. The idea of high-throughput generation is to offload parameters and attention cache as much as possible to CPU and disk if necessary. You can see the reference startegies in our benchmark here. Scaling to Distributed GPUs If you have more GPUs, FlexGen can combine offloading with pipeline parallelism to allow scaling. For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation. See examples here. Run Chatbot with OPT models up to 175B on a Single GPU chatbot.py shows how to build a chatbot with FlexGen and OPT models. While FlexGen is mainly optimized for large-batch throughput-oriented scenarios like dataset evaluations and information extraction, FlexGen can also be used for interactive applications like chatbot with better performance than other offloading-based systems. Note that FlexGen cannot achieve its best throughput in this single-batch case. Commands # Chat with OPT-6.7B python3 chatbot.py --model facebook/opt-6.7b # Chat with OPT-30B python3 chatbot.py --model facebook/opt-30b --percent 0 100 100 0 100 0 Example output A chat between a curious human and a knowledgeable artificial intelligence assistant. Human: Hello! What can you do? Assistant: As an AI assistant, I can answer questions and chat with you. Human: What is the name of the tallest mountain in the world? Assistant: Everest. Human: I am planning a trip for our anniversary. What things can we do? Assistant: Well, there are a number of things you can do for your anniversary. First, you can play cards. Second, you can go for a hike. Third, you can go to a museum. Roadmap We plan to work on the following features. Community conributions are welcome. * [ ] Support Apple silicon M1/M2 deployment * [ ] Support Colab deployement * [ ] Optimize the latency of the chatbot application * [ ] Add a text summarization application * [ ] Support more models (BLOOM, CodeGen, OPT-IML) * [ ] Release the cost model and policy optimizer * [ ] Release a pip installable package Acknowledgement This is a research project developed by HazyResearch@Stanford, DS3Lab@ETH Zurich, CRFM@Stanford, SkyComputing@UC Berkeley and TogetherCompute. About Running large language models like OPT-175B/GPT-3 on a single GPU. Up to 100x faster than other offloading systems. Topics machine-learning deep-learning offloading high-throughput opt gpt-3 large-language-models chatgpt Resources Readme License Apache-2.0 license Stars 739 stars Watchers 15 watching Forks 26 forks Releases No releases published Packages 0 No packages published Contributors 5 * @Ying1123 * @merrymercy * @keroro824 * @eazel7 * @zhuohan123 Languages * Python 95.3% * Shell 4.7% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.