https://github.com/FMInference/FlexGen Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} FMInference / FlexGen Public * Notifications * Fork 393 * Star 7.3k Running large language models on a single GPU for throughput-oriented scenarios. License Apache-2.0 license 7.3k stars 393 forks Star Notifications * Code * Issues 26 * Pull requests 2 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights FMInference/FlexGen This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/F] Use Git or checkout with SVN using the web URL. [gh repo clone FMInfe] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @Ying1123 Ying1123 Update README.md ... 3502de5 Mar 26, 2023 Update README.md 3502de5 Git stats * 96 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time benchmark Update Petals setup details March 7, 2023 12:47 docs Update paper.md (#102) March 21, 2023 10:13 flexgen Data wrangle benchmark (#95) March 8, 2023 20:17 scripts Simplify API (#68) February 25, 2023 22:55 .gitignore FlexGen for Data wrangle Tasks. (#91) March 6, 2023 14:33 LICENSE Release and merge commits February 21, 2023 02:38 README.md Update README.md March 26, 2023 10:42 pyproject.toml Add more HELM examples (#82) March 1, 2023 11:21 View code [ ] FlexGen Throughput-Oriented Inference for Large Language Models Install Method 1: With pip Method 2: From source Examples HELM Benchmark Data Wrangling Performance Benchmark Generation Throughput (token/s) Roadmap README.md FlexGen FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes. Throughput-Oriented Inference for Large Language Models In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been applied not only to interactive applications (such as chat), but also to many "back-of-house" tasks. These tasks include benchmarking, information extraction, data wrangling, and form processing. One key characteristic of these applications is that they are throughput-oriented: they require running LLM inferences over millions of tokens in batches, e.g., all the private documents in a company's corpus, or all the tasks in the HELM benchmark. These workloads are less sensitive to latency - the user starts up a job and lets it run overnight - but increasing throughput is critical for reducing costs. Throughput is a measure of tokens processed per second over the job's entire runtime (which can be hours). Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which makes it easier to take advantage of low-cost commodity GPUs. The goal of FlexGen is to create a high-throughput system to enable new and exciting applications of foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU instead of expensive systems. Check out the examples of what you can run on a single commodity GPU with FlexGen, including benchmarking and data wrangling. Limitation. As an offloading-based system running on weak GPUs, FlexGen also has its limitations. FlexGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs. --------------------------------------------------------------------- This project was made possible thanks to a collaboration with [6874747073] [6874747073] [6874747073] [6874747073] [6874747073] [220273382-] --------------------------------------------------------------------- Install Requirements: * PyTorch >= 1.12 (Help) Method 1: With pip pip install flexgen Method 2: From source git clone https://github.com/FMInference/FlexGen.git cd FlexGen pip install -e . Examples HELM Benchmark FlexGen can be integrated into HELM, a language model benchmark framework, as its execution backend. You can use the commands below to run a Massive Multitask Language Understanding (MMLU) scenario with a single T4 (16GB) GPU and 200GB of DRAM. python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100 Note that only a subset of HELM scenarios is tested. See more tested scenarios here. Data Wrangling You can run the examples in this paper, 'Can Foundation Models Wrangle Your Data?', by following the instructions here. Performance Benchmark Generation Throughput (token/s) The corresponding effective batch sizes and lowest offloading devices are in parentheses. Please see here for more details. System OPT-6.7B OPT-30B OPT-175B Hugging Face 25.12 (2 on 0.62 (8 on CPU) 0.01 (2 on Accelerate GPU) disk) DeepSpeed 9.28 (16 on 0.60 (4 on CPU) 0.01 (1 on ZeRO-Inference CPU) disk) Petals 8.25 (2 on GPU) 2.84 (2 on GPU) 0.08 (2 on GPU) FlexGen 25.26 (2 on 7.32 (144 on 0.69 (256 on GPU) CPU) disk) FlexGen with 29.12 (72 on 8.38 (512 on 1.12 (144 on Compression GPU) CPU) CPU) * Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD. * Workload: input sequence length = 512, output sequence length = 32. The batch size is tuned to a large value that maximizes the generation throughput for each system. * Metric: generation throughput (token/s) = number of the generated tokens / (time for processing prompts + time for generation). How to reproduce. Roadmap We plan to work on the following features. * [ ] Optimize the performance for multiple GPUs on the same machine * [ ] Support more models (BLOOM, CodeGen, GLM) * [ ] Release the cost model and policy optimizer * [ ] Macbook Support (M1 and M2) * [ ] AMD Support About Running large language models on a single GPU for throughput-oriented scenarios. Topics machine-learning deep-learning offloading high-throughput opt gpt-3 large-language-models Resources Readme License Apache-2.0 license Stars 7.3k stars Watchers 87 watching Forks 393 forks Releases No releases published Packages 0 No packages published Used by 5 * @northboat * @luminai-companion * @sceuick * @oobabooga Contributors 19 * @Ying1123 * @merrymercy * @zhangce * @keroro824 * @mryab * @BinhangYuan * @zhuohan123 * @DanFu09 * @LukeLIN-web * @eazel7 * @shughes-uk + 8 contributors Languages * Python 96.5% * Shell 3.5% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.