https://github.com/S-LoRA/S-LoRA

Skip to content
Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
S-LoRA / S-LoRA Public

  * Notifications
  * Fork 48
  * Star 1k

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

arxiv.org/abs/2311.03285

License

Apache-2.0 license
1k stars 48 forks Activity
Star
Notifications

  * Code
  * Issues 15
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

S-LoRA/S-LoRA

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/S]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone S-LoRA]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@whitelok
whitelok and karll add tips for install from source or we will get
error from cuda (#13)
...
5cdd998 Nov 20, 2023
add tips for install from source or we will get error from cuda (#13)

Co-authored-by: karll <karlluo@tencent.com>

5cdd998

Git stats

  * 21 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
benchmarks
ad hf load utils (#1)
November 6, 2023 21:59
figures
Add files via upload
November 9, 2023 03:10
slora
minor
November 7, 2023 06:32
test
Initial release
November 7, 2023 04:21
.gitignore
Initial release
November 7, 2023 04:21
LICENSE
Create LICENSE
November 6, 2023 22:40
README.md
add tips for install from source or we will get error from cuda (#13)
November 20, 2023 10:24
setup.py
ad hf load utils (#1)
November 6, 2023 21:59
View code
[                    ]
S-LoRA: Serving Thousands of Concurrent LoRA Adapters [paper]
Abstract Requirements Installation Example Run Methods Evaluation
Settings Results Acknowledgment Roadmap Citation

README.md

 S-LoRA: Serving Thousands of Concurrent LoRA Adapters [paper]

                                perf

 Abstract

The "pretrain-then-finetune" paradigm is commonly adopted in the
deployment of large language models. Low-Rank Adaptation (LoRA), a
parameter-efficient fine-tuning method, is often employed to adapt a
base model to a multitude of tasks, resulting in a substantial
collection of LoRA adapters derived from one base model. We observe
that this paradigm presents significant opportunities for batched
inference during serving. To capitalize on these opportunities, we
present S-LoRA, a system designed for the scalable serving of many
LoRA adapters. S-LoRA stores all adapters in the main memory and
fetches the adapters used by the currently running queries to the GPU
memory. To efficiently use the GPU memory and reduce fragmentation,
S-LoRA proposes Unified Paging. Unified Paging uses a unified memory
pool to manage dynamic adapter weights with different ranks and KV
cache tensors with varying sequence lengths. Additionally, S-LoRA
employs a novel tensor parallelism strategy and highly optimized
custom CUDA kernels for heterogeneous batching of LoRA computation.
Collectively, these features enable S-LoRA to serve thousands of LoRA
adapters on a single GPU or across multiple GPUs with a small
overhead. Compared to state-of-the-art libraries such as HuggingFace
PEFT and vLLM (with naive support of LoRA serving), S-LoRA can
improve the throughput by up to 4 times and increase the number of
served adapters by several orders of magnitude. As a result, S-LoRA
enables scalable serving of many task-specific fine-tuned models and
offers the potential for large-scale customized fine-tuning services.

                              overview

 Requirements

  * CUDA 11.8 compatible GPU
      + Recommended: GPUs from the Ampere family, like the A100,
        which support bfloat16 operations.
      + Note: Older GPUs from the Turing family like the T4, which do
        not support bfloat16, are not supported.
  * 1.13 <= PyTorch <= 2.0.1

 Installation

conda create -n slora python=3.9
conda activate slora
# Optional: Install CUDA via conda for a smoother installation experience,
# but you may need to manually set the Anaconda path variables.
# conda install cuda -c nvidia/label/cuda-11.8.0
# set environment variables: export TORCH_CUDA_ARCH_LIST="8.0 8.6"
pip install torch==2.0.1
pip install -e .

Make sure triton==2.1.0

For more details on installing CUDA via conda, refer to the CUDA
Installation Guide by NVIDIA.

 Example Run

Real model weights

cd benchmarks
python launch_server.py --num-adapter 100 --num-token 10000 --model-setting Real
python run_exp.py --debug --model-setting Real

Dummy weights

cd benchmarks
python launch_server.py --num-adapter 100 --num-token 10000 --dummy
python run_exp.py --debug

Test

cd test/test_e2e
python launch_server.py
python run_exp.py

 Methods

  * Unified Paging: To reduce memory fragmentation and increase batch
    size, S-LoRA introduces a unified memory pool. This pool manages
    dynamic adapter weights and KV cache tensors by a unified paging
    mechanism.

                            unifiedpaging

  * Heterogeneous Batching: To minimize the latency overhead when
    batching different adapters of varying ranks, S-LoRA employs
    highly optimized custom CUDA kernels. These kernels operate
    directly on non-contiguous memory and align with the memory pool
    design, facilitating efficient batched inference for added LoRA
    computation.

  * S-LoRA TP: To ensure effective parallelization across multiple
    GPUs, S-LoRA introduces a novel tensor parallelism strategy. This
    approach incurs minimal communication cost for the added LoRA
    computation compared to that of the base model. This is realized
    by scheduling communications on small intermediate tensors and
    fusing them with the communications of the base model.

                              slora_tp

 Evaluation

 Settings

Model Settings:

Setting Base model Hidden size  Adapter ranks
S1      Llama-7B   4096        {8}
S2      Llama-7B   4096        {64, 32, 16, 8}
S4      Llama-13B  5120        {64, 32, 16}
S5      Llama-30B  7168        {32}
S6      Llama-70B  8192        {64}

Baselines:

PEFT stands for HuggingFace PEFT: We build a server using it that
batches single adapter requests and switches adapter weights between
batches.

vLLM-packed: Because vLLM does not support LoRA, we merge the LoRA
weights into the base model and serve the multiple versions of the
merged weights separately. To serve m LoRA adapters, we run m vLLM
workers on a single GPU, where multiple workers are separate
processes managed by NVIDIA MPS.

S-LoRA-no-unify-mem: S-LoRA without the Unified Paging.

S-LoRA-bmm: S-LoRA without Unified Paging and customized kernels. It
copies the adapter weights to continuous memory space and performs
batched matrix multiplication with padding.

Please see our paper about the trace for synthetic workloads.

 Results

  * We compare S-LoRA with both vLLM-packed and HuggingFace PEFT for
    serving many LoRA adapters.

                            vllm_and_peft

  * Comparing with own variants.

                              synthetic

  * We test the scalability of our tensor parallelism strategy.

                                 tp

 Acknowledgment

SLoRA is build on top of LightLLM.

We also learned a lot from the following projects when developing
S-LoRA.

  * punica
  * PEFT
  * vLLM

 Roadmap

  * [ ] Release tensor parallelism implementation
  * [ ] Clean up reproducible scripts
  * [ ] More user-friendly API/frontend
  * [ ] More model support

 Citation

@misc{sheng2023slora,
      title={S-LoRA: Serving Thousands of Concurrent LoRA Adapters},
      author={Ying Sheng and Shiyi Cao and Dacheng Li and Coleman Hooper and Nicholas Lee and Shuo Yang and Christopher Chou and Banghua Zhu and Lianmin Zheng and Kurt Keutzer and Joseph E. Gonzalez and Ion Stoica},
      year={2023},
      eprint={2311.03285},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

About

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

arxiv.org/abs/2311.03285

Resources

Readme

License

Apache-2.0 license
Activity

Stars

1k stars

Watchers

19 watching

Forks

48 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 6

  * @Ying1123
  * @caoshiyi
  * @merrymercy
  * @andy-yang-1
  * @whitelok
  * @Muhtasham

Languages

  * Python 87.3%
  * Cuda 7.4%
  * Shell 4.4%
  * C++ 0.9%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * 

You can't perform that action at this time.