https://github.com/Ying1123/FlexGen

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
      + Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
      + For
      + Enterprise
      + Teams
      + Startups
      + Education
      + By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
      + Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
      + Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
Ying1123 / FlexGen Public

  * Notifications
  * Fork 26
  * Star 739

Running large language models like OPT-175B/GPT-3 on a single GPU. Up
to 100x faster than other offloading systems.

License

Apache-2.0 license
739 stars 26 forks
Star
Notifications

  * Code
  * Issues 4
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

Ying1123/FlexGen

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/Y]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone Ying11]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@keroro824
keroro824 Update README.md (#7)
...
fff8417 Feb 20, 2023
Update README.md (#7)
fff8417

Git stats

  * 11 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
apps
 
 
benchmark
 
 
docs
 
 
flexgen
 
 
scripts
 
 
.gitignore
 
 
LICENSE
 
 
README.md
 
 
setup.py
 
 
View code
[                    ]
FlexGen Content Benchmark Results Generation Throughput (token/s)
Latency-throughput Trade-off How It Works Install Get Started with a
Single GPU OPT-1.3B OPT-30B OPT-175B How to set the offloading
strategy? Scaling to Distributed GPUs Run Chatbot with OPT models up
to 175B on a Single GPU Commands Example output Roadmap
Acknowledgement

README.md

 FlexGen

FlexGen is a high-throughput generation engine for running large
language models with limited GPU memory (e.g., a 16GB T4 or a 24GB
RTX3090 gaming card!).

Large language models (LLMs) are at the heart of applications like
ChatGPT and Copilot, but the high computational and memory
requirements of LLM inference traditionally make it feasible only
with multiple high-end accelerators. FlexGen aims to lower the
resource requirements of LLM inference down to a single commodity GPU
(e.g., T4, 3090) and allow flexible deployment for various hardware
setups.

The key features of FlexGen include:

[?] Lightining Fast Offloading.
Up to 100x faster than other offloading-based systems for running
175B models on a single GPU.

 Extreme Compression.
Compress both the parameters and attention cache of models, such as
OPT-175B, down to 4 bits with negligible accuracy loss.

 Scalability.
Come with a distributed pipeline parallelism runtime to allow scaling
if more GPUs are given.

| Read Paper | Join Discord |

 Content

  * Benchmark Results
  * Install
  * Get Started with a Single GPU
  * Run Chatbot with OPT models up to 175B on a Single GPU
  * Scaling to Distributed GPUs
  * Roadmap

 Benchmark Results

 Generation Throughput (token/s)

         System          OPT-6.7B OPT-30B OPT-175B
Huggingface Accelerate   25.12    0.62    0.01
DeepSpeed ZeRO-Inference 9.28     0.60    0.01
Petals*                  -        -       0.05
FlexGen                  25.26    7.32    0.69
FlexGen with Compression 29.12    8.38    1.12

  * Hardware: an NVIDIA T4 (16GB) instance on GCP with 208GB of DRAM
    and 1.5TB of SSD.
  * Workload: input sequence length = 512, output sequence length =
    32. The batch size is tuned to a value that maximizes the
    generation throughput for each system.
  * Metric: generation throughput (token/s) = number of the generated
    tokens / (time for processing prompts + time for generation).

How to reproduce.

 Latency-throughput Trade-off

The figure below shows the latency and throughput trade-off of three
offloading-based systems on OPT-175B (left) and OPT-30B (right).
FlexGen achieves a new Pareto-optimal frontier with a 100x higher
maximum throughput for OPT-175B. Other systems cannot further
increase throughput due to out-of-memory. ``(c)'' denotes FlexGen
with compression.

logo

 How It Works

FlexGen can be flexibly configured under various hardware resource
constraints by aggregating memory and computation from the GPU, CPU,
and disk. Through a linear programming optimizer, it searches for the
best pattern to store and access the tensors, including weights,
activations, and attention key/value (KV) cache. FlexGen further
compresses both weights and KV cache to 4 bits with negligible
accuracy loss.

One key idea of FlexGen is to play the latency-throughput trade-off.
Achieving low latency is inherently challenging for offloading
methods, but the efficiency of offloading can be greatly boosted for
throughput-oriented scenarios (see the figure above). FlexGen
utilizes a block schedule to reuse weight and overlap I/O with
computation, as shown in figure (b) below, while other baseline
systems use an ineffiicent row-by-row schedule, as shown in figure
(a) below.

logo

More details can be found in our paper.

 Install

Requirements:

torch>=1.12
transformers>=4.24

Instructions:

git clone https://github.com/Ying1123/FlexGen.git
cd FlexGen
pip3 install -e .

# (Optional) Install openmpi for multi-gpu execution
# sudo apt install openmpi-bin

 Get Started with a Single GPU

 OPT-1.3B

To get started, you can try a small model like OPT-1.3B first. It
fits into a single GPU so no offloading is required. FlexGen will
automatically download weights from huggingface.

python3 -m flexgen.flex_opt --model facebook/opt-1.3b

 OPT-30B

To run large models like OPT-30B, you will need to use CPU
offloading. You can try commands below. The --percent arguments
specify the offloading strategy for parameters, attention cache and
hidden states separately.

python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0

 OPT-175B

To run OPT-175B, you need to download the weights from metaseq and
convert the weights into Alpa format. You can then try CPU/disk
offloading by

python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 0 0 0 0 --offload-dir YOUR_SSD_FOLDER

 How to set the offloading strategy?

We will release an automatic policy optimizer later, but now you have
to manually try a few strategies. The idea of high-throughput
generation is to offload parameters and attention cache as much as
possible to CPU and disk if necessary. You can see the reference
startegies in our benchmark here.

 Scaling to Distributed GPUs

If you have more GPUs, FlexGen can combine offloading with pipeline
parallelism to allow scaling. For example, if you have 2 GPUs but the
aggregated GPU memory is less than the model size, you still need
offloading. FlexGen allow you to do pipeline parallelism with these 2
GPUs to accelerate the generation. See examples here.

 Run Chatbot with OPT models up to 175B on a Single GPU

chatbot.py shows how to build a chatbot with FlexGen and OPT models.
While FlexGen is mainly optimized for large-batch throughput-oriented
scenarios like dataset evaluations and information extraction,
FlexGen can also be used for interactive applications like chatbot
with better performance than other offloading-based systems. Note
that FlexGen cannot achieve its best throughput in this single-batch
case.

 Commands

# Chat with OPT-6.7B
python3 chatbot.py --model facebook/opt-6.7b

# Chat with OPT-30B
python3 chatbot.py --model facebook/opt-30b --percent 0 100 100 0 100 0

 Example output

A chat between a curious human and a knowledgeable artificial intelligence assistant.
Human: Hello! What can you do?
Assistant: As an AI assistant, I can answer questions and chat with you.
Human: What is the name of the tallest mountain in the world?
Assistant: Everest.
Human: I am planning a trip for our anniversary. What things can we do?
Assistant: Well, there are a number of things you can do for your anniversary. First, you can play cards. Second, you can go for a hike. Third, you can go to a museum.

 Roadmap

We plan to work on the following features. Community conributions are
welcome.

  * [ ] Support Apple silicon M1/M2 deployment
  * [ ] Support Colab deployement
  * [ ] Optimize the latency of the chatbot application
  * [ ] Add a text summarization application
  * [ ] Support more models (BLOOM, CodeGen, OPT-IML)
  * [ ] Release the cost model and policy optimizer
  * [ ] Release a pip installable package

 Acknowledgement

This is a research project developed by HazyResearch@Stanford,
DS3Lab@ETH Zurich, CRFM@Stanford, SkyComputing@UC Berkeley and
TogetherCompute.

About

Running large language models like OPT-175B/GPT-3 on a single GPU. Up
to 100x faster than other offloading systems.

Topics

machine-learning deep-learning offloading high-throughput opt gpt-3 
large-language-models chatgpt

Resources

Readme

License

Apache-2.0 license

Stars

739 stars

Watchers

15 watching

Forks

26 forks

Releases

No releases published

Packages 0

No packages published

Contributors 5

  * @Ying1123
  * @merrymercy
  * @keroro824
  * @eazel7
  * @zhuohan123

Languages

  * Python 95.3%
  * Shell 4.7%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.