https://github.com/S-LoRA/S-LoRA Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} S-LoRA / S-LoRA Public * Notifications * Fork 48 * Star 1k S-LoRA: Serving Thousands of Concurrent LoRA Adapters arxiv.org/abs/2311.03285 License Apache-2.0 license 1k stars 48 forks Activity Star Notifications * Code * Issues 15 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights S-LoRA/S-LoRA This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/S] Use Git or checkout with SVN using the web URL. [gh repo clone S-LoRA] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @whitelok whitelok and karll add tips for install from source or we will get error from cuda (#13) ... 5cdd998 Nov 20, 2023 add tips for install from source or we will get error from cuda (#13) Co-authored-by: karll 5cdd998 Git stats * 21 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time benchmarks ad hf load utils (#1) November 6, 2023 21:59 figures Add files via upload November 9, 2023 03:10 slora minor November 7, 2023 06:32 test Initial release November 7, 2023 04:21 .gitignore Initial release November 7, 2023 04:21 LICENSE Create LICENSE November 6, 2023 22:40 README.md add tips for install from source or we will get error from cuda (#13) November 20, 2023 10:24 setup.py ad hf load utils (#1) November 6, 2023 21:59 View code [ ] S-LoRA: Serving Thousands of Concurrent LoRA Adapters [paper] Abstract Requirements Installation Example Run Methods Evaluation Settings Results Acknowledgment Roadmap Citation README.md S-LoRA: Serving Thousands of Concurrent LoRA Adapters [paper] perf Abstract The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. overview Requirements * CUDA 11.8 compatible GPU + Recommended: GPUs from the Ampere family, like the A100, which support bfloat16 operations. + Note: Older GPUs from the Turing family like the T4, which do not support bfloat16, are not supported. * 1.13 <= PyTorch <= 2.0.1 Installation conda create -n slora python=3.9 conda activate slora # Optional: Install CUDA via conda for a smoother installation experience, # but you may need to manually set the Anaconda path variables. # conda install cuda -c nvidia/label/cuda-11.8.0 # set environment variables: export TORCH_CUDA_ARCH_LIST="8.0 8.6" pip install torch==2.0.1 pip install -e . Make sure triton==2.1.0 For more details on installing CUDA via conda, refer to the CUDA Installation Guide by NVIDIA. Example Run Real model weights cd benchmarks python launch_server.py --num-adapter 100 --num-token 10000 --model-setting Real python run_exp.py --debug --model-setting Real Dummy weights cd benchmarks python launch_server.py --num-adapter 100 --num-token 10000 --dummy python run_exp.py --debug Test cd test/test_e2e python launch_server.py python run_exp.py Methods * Unified Paging: To reduce memory fragmentation and increase batch size, S-LoRA introduces a unified memory pool. This pool manages dynamic adapter weights and KV cache tensors by a unified paging mechanism. unifiedpaging * Heterogeneous Batching: To minimize the latency overhead when batching different adapters of varying ranks, S-LoRA employs highly optimized custom CUDA kernels. These kernels operate directly on non-contiguous memory and align with the memory pool design, facilitating efficient batched inference for added LoRA computation. * S-LoRA TP: To ensure effective parallelization across multiple GPUs, S-LoRA introduces a novel tensor parallelism strategy. This approach incurs minimal communication cost for the added LoRA computation compared to that of the base model. This is realized by scheduling communications on small intermediate tensors and fusing them with the communications of the base model. slora_tp Evaluation Settings Model Settings: Setting Base model Hidden size Adapter ranks S1 Llama-7B 4096 {8} S2 Llama-7B 4096 {64, 32, 16, 8} S4 Llama-13B 5120 {64, 32, 16} S5 Llama-30B 7168 {32} S6 Llama-70B 8192 {64} Baselines: PEFT stands for HuggingFace PEFT: We build a server using it that batches single adapter requests and switches adapter weights between batches. vLLM-packed: Because vLLM does not support LoRA, we merge the LoRA weights into the base model and serve the multiple versions of the merged weights separately. To serve m LoRA adapters, we run m vLLM workers on a single GPU, where multiple workers are separate processes managed by NVIDIA MPS. S-LoRA-no-unify-mem: S-LoRA without the Unified Paging. S-LoRA-bmm: S-LoRA without Unified Paging and customized kernels. It copies the adapter weights to continuous memory space and performs batched matrix multiplication with padding. Please see our paper about the trace for synthetic workloads. Results * We compare S-LoRA with both vLLM-packed and HuggingFace PEFT for serving many LoRA adapters. vllm_and_peft * Comparing with own variants. synthetic * We test the scalability of our tensor parallelism strategy. tp Acknowledgment SLoRA is build on top of LightLLM. We also learned a lot from the following projects when developing S-LoRA. * punica * PEFT * vLLM Roadmap * [ ] Release tensor parallelism implementation * [ ] Clean up reproducible scripts * [ ] More user-friendly API/frontend * [ ] More model support Citation @misc{sheng2023slora, title={S-LoRA: Serving Thousands of Concurrent LoRA Adapters}, author={Ying Sheng and Shiyi Cao and Dacheng Li and Coleman Hooper and Nicholas Lee and Shuo Yang and Christopher Chou and Banghua Zhu and Lianmin Zheng and Kurt Keutzer and Joseph E. Gonzalez and Ion Stoica}, year={2023}, eprint={2311.03285}, archivePrefix={arXiv}, primaryClass={cs.LG} } About S-LoRA: Serving Thousands of Concurrent LoRA Adapters arxiv.org/abs/2311.03285 Resources Readme License Apache-2.0 license Activity Stars 1k stars Watchers 19 watching Forks 48 forks Report repository Releases No releases published Packages 0 No packages published Contributors 6 * @Ying1123 * @caoshiyi * @merrymercy * @andy-yang-1 * @whitelok * @Muhtasham Languages * Python 87.3% * Cuda 7.4% * Shell 4.4% * C++ 0.9% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * You can't perform that action at this time.