https://github.com/FMInference/FlexGen

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
      + Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
      + For
      + Enterprise
      + Teams
      + Startups
      + Education
      + By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
      + Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
      + Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
FMInference / FlexGen Public

  * Notifications
  * Fork 393
  * Star 7.3k

Running large language models on a single GPU for throughput-oriented
scenarios.

License

Apache-2.0 license
7.3k stars 393 forks
Star
Notifications

  * Code
  * Issues 26
  * Pull requests 2
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

FMInference/FlexGen

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/F]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone FMInfe]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@Ying1123
Ying1123 Update README.md
...
3502de5 Mar 26, 2023
Update README.md
3502de5

Git stats

  * 96 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
benchmark
Update Petals setup details
March 7, 2023 12:47
docs
Update paper.md (#102)
March 21, 2023 10:13
flexgen
Data wrangle benchmark (#95)
March 8, 2023 20:17
scripts
Simplify API (#68)
February 25, 2023 22:55
.gitignore
FlexGen for Data wrangle Tasks. (#91)
March 6, 2023 14:33
LICENSE
Release and merge commits
February 21, 2023 02:38
README.md
Update README.md
March 26, 2023 10:42
pyproject.toml
Add more HELM examples (#82)
March 1, 2023 11:21
View code
[                    ]
FlexGen Throughput-Oriented Inference for Large Language Models
Install Method 1: With pip Method 2: From source Examples HELM
Benchmark Data Wrangling Performance Benchmark Generation Throughput
(token/s) Roadmap

README.md

 FlexGen

FlexGen is a high-throughput generation engine for running large
language models with limited GPU memory. FlexGen allows
high-throughput generation by IO-efficient offloading, compression,
and large effective batch sizes.

 Throughput-Oriented Inference for Large Language Models

In recent years, large language models (LLMs) have shown great
performance across a wide range of tasks. Increasingly, LLMs have
been applied not only to interactive applications (such as chat), but
also to many "back-of-house" tasks. These tasks include benchmarking,
information extraction, data wrangling, and form processing.

One key characteristic of these applications is that they are
throughput-oriented: they require running LLM inferences over
millions of tokens in batches, e.g., all the private documents in a
company's corpus, or all the tasks in the HELM benchmark. These
workloads are less sensitive to latency - the user starts up a job
and lets it run overnight - but increasing throughput is critical for
reducing costs. Throughput is a measure of tokens processed per
second over the job's entire runtime (which can be hours).
Throughput-oriented workloads provide opportunities to trade off
latency for higher throughput, which makes it easier to take
advantage of low-cost commodity GPUs.

The goal of FlexGen is to create a high-throughput system to enable
new and exciting applications of foundation models to
throughput-oriented tasks on low-cost hardware, such as a single
commodity GPU instead of expensive systems.

Check out the examples of what you can run on a single commodity GPU
with FlexGen, including benchmarking and data wrangling.

 Limitation. As an offloading-based system running on weak GPUs,
FlexGen also has its limitations. FlexGen can be significantly slower
than the case when you have enough powerful GPUs to hold the whole
model, especially for small-batch cases. FlexGen is mostly optimized
for throughput-oriented batch processing settings (e.g., classifying
or extracting information from many documents in batches), on single
GPUs.

---------------------------------------------------------------------

This project was made possible thanks to a collaboration with

[6874747073]     [6874747073]     [6874747073]     [6874747073]    
[6874747073]     [220273382-]

---------------------------------------------------------------------

 Install

Requirements:

  * PyTorch >= 1.12 (Help)

 Method 1: With pip

pip install flexgen

 Method 2: From source

git clone https://github.com/FMInference/FlexGen.git
cd FlexGen
pip install -e .

 Examples

 HELM Benchmark

FlexGen can be integrated into HELM, a language model benchmark
framework, as its execution backend. You can use the commands below
to run a Massive Multitask Language Understanding (MMLU) scenario
with a single T4 (16GB) GPU and 200GB of DRAM.

python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100

Note that only a subset of HELM scenarios is tested. See more tested
scenarios here.

 Data Wrangling

You can run the examples in this paper, 'Can Foundation Models
Wrangle Your Data?', by following the instructions here.

 Performance Benchmark

 Generation Throughput (token/s)

The corresponding effective batch sizes and lowest offloading devices
are in parentheses. Please see here for more details.

       System            OPT-6.7B         OPT-30B        OPT-175B
Hugging Face          25.12 (2 on     0.62 (8 on CPU) 0.01 (2 on
Accelerate            GPU)                            disk)
DeepSpeed             9.28 (16 on     0.60 (4 on CPU) 0.01 (1 on
ZeRO-Inference        CPU)                            disk)
Petals                8.25 (2 on GPU) 2.84 (2 on GPU) 0.08 (2 on GPU)
FlexGen               25.26 (2 on     7.32 (144 on    0.69 (256 on
                      GPU)            CPU)            disk)
FlexGen with          29.12 (72 on    8.38 (512 on    1.12 (144 on
Compression           GPU)            CPU)            CPU)

  * Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM
    and 1.5TB of SSD.
  * Workload: input sequence length = 512, output sequence length =
    32. The batch size is tuned to a large value that maximizes the
    generation throughput for each system.
  * Metric: generation throughput (token/s) = number of the generated
    tokens / (time for processing prompts + time for generation).

How to reproduce.

 Roadmap

We plan to work on the following features.

  * [ ] Optimize the performance for multiple GPUs on the same
    machine
  * [ ] Support more models (BLOOM, CodeGen, GLM)
  * [ ] Release the cost model and policy optimizer
  * [ ] Macbook Support (M1 and M2)
  * [ ] AMD Support

About

Running large language models on a single GPU for throughput-oriented
scenarios.

Topics

machine-learning deep-learning offloading high-throughput opt gpt-3 
large-language-models

Resources

Readme

License

Apache-2.0 license

Stars

7.3k stars

Watchers

87 watching

Forks

393 forks

Releases

No releases published

Packages 0

No packages published

Used by 5

 

  * @northboat
  * @luminai-companion
  * @sceuick
  * @oobabooga

Contributors 19

  * @Ying1123
  * @merrymercy
  * @zhangce
  * @keroro824
  * @mryab
  * @BinhangYuan
  * @zhuohan123
  * @DanFu09
  * @LukeLIN-web
  * @eazel7
  * @shughes-uk

+ 8 contributors

Languages

  * Python 96.5%
  * Shell 3.5%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.