https://github.com/SJTU-IPADS/PowerInfer Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} SJTU-IPADS / PowerInfer Public * Notifications * Fork 60 * Star 1.7k High-speed Large Language Model Serving on PCs with Consumer-grade GPUs License MIT license 1.7k stars 60 forks Activity Star Notifications * Code * Issues 11 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights SJTU-IPADS/PowerInfer This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/S] Use Git or checkout with SVN using the web URL. [gh repo clone SJTU-I] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @eltociear eltociear Update README.md (#24) ... 9fc075a Dec 20, 2023 Update README.md (#24) genearted -> generated 9fc075a Git stats * 1,544 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .devops ci : Cloud-V for RISC-V builds (#3160) September 15, 2023 11:06 .github ci : use intel sde when ci cpu doesn't support avx512 (#3949) November 5, 2023 09:46 ci save-load-state : fix example + add ci test (#3655) October 17, 2023 19:12 cmake cmake : MSVC instruction detection (fixed up #809) (#3923) November 5, 2023 10:03 common Online GPU slicing (#11) December 20, 2023 10:09 docs Fix some documentation typos/grammar mistakes (#4032) November 11, 2023 23:04 examples Online GPU slicing (#11) December 20, 2023 10:09 gguf-py Online GPU slicing (#11) December 20, 2023 10:09 grammars Fix some documentation typos/grammar mistakes (#4032) November 11, 2023 23:04 media media : add logos and banners April 5, 2023 18:58 models stablelm : StableLM support (#3586) November 14, 2023 11:17 pocs build : enable more non-default compiler warnings (#3200) September 28, 2023 17:41 powerinfer-py Online GPU slicing (#11) December 20, 2023 10:09 prompts speculative : add tree-based sampling example (#3624) October 18, 2023 16:21 scripts Online GPU slicing (#11) December 20, 2023 10:09 spm-headers swift : Package compile breaks due to ggml-metal.metal (#1831) June 15, 2023 20:47 tests stablelm : StableLM support (#3586) November 14, 2023 11:17 .clang-tidy fix some warnings from gcc and clang-tidy (#3038) September 7, 2023 13:22 .dockerignore docker : ignore Git files (#3314) October 2, 2023 11:53 .ecrc Fix whitespace, add .editorconfig, add GitHub workflow (#883) April 11, 2023 19:45 .editorconfig server : add a subtle loading animation to the edit box (#2466) September 4, 2023 16:28 .flake8 hooks : setting up flake8 and pre-commit hooks (#1681) June 17, 2023 13:32 .gitignore merge PowerInfer impl from the internal codebase December 12, 2023 11:08 .pre-commit-config.yaml hooks : setting up flake8 and pre-commit hooks (#1681) June 17, 2023 13:32 CMakeLists.txt add fallback for m chip & fix compiler bugs (#4) December 14, 2023 22:53 LICENSE Update LICENSE and TODOs in README (#14) December 19, 2023 16:23 Makefile Fix MacOS Sonoma model quantization (#4052) November 14, 2023 12:34 Package.swift ggml : quantization refactoring (#3833) October 29, 2023 18:32 README.md Update README.md (#24) December 21, 2023 00:01 SHA256SUMS Update SHA256SUMS with current hashes for models quantized using q4_0... June 11, 2023 12:38 build.zig build : link against build info instead of compiling against it (# 3879) November 2, 2023 08:50 codecov.yml cov : disable comment in PRs (#2989) September 3, 2023 13:19 convert-baichuan-hf-to-gguf.py gguf-py: Refactor and allow reading/modifying existing GGUF files (# 3... November 11, 2023 08:04 convert-hf-to-gguf.py stablelm : StableLM support (#3586) November 14, 2023 11:17 convert-hf-to-powerinfer-gguf.py merge PowerInfer impl from the internal codebase December 12, 2023 11:08 convert-llama-ggml-to-gguf.py gguf-py: Refactor and allow reading/modifying existing GGUF files (# 3... November 11, 2023 08:04 convert-lora-to-ggml.py convert : fix python 3.8 support, modernize type annotations (#2916) August 31, 2023 08:02 convert-persimmon-to-gguf.py gguf-py: Refactor and allow reading/modifying existing GGUF files (# 3... November 11, 2023 08:04 convert.py Add instructions for easier hands-on (#20) December 20, 2023 19:08 flake.lock flake.nix: fix for rocm 5.7 (#3853) October 31, 2023 19:24 flake.nix flake.nix: fix for rocm 5.7 (#3853) October 31, 2023 19:24 ggml-alloc.c sync : ggml (backend v2) (#3912) November 13, 2023 14:16 ggml-alloc.h sync : ggml (backend v2) (#3912) November 13, 2023 14:16 ggml-backend-impl.h sync : ggml (backend v2) (#3912) November 13, 2023 14:16 ggml-backend.c sync : ggml (backend v2) (#3912) November 13, 2023 14:16 ggml-backend.h sync : ggml (backend v2) (#3912) November 13, 2023 14:16 ggml-cuda.cu Configurable sparse prediction threshold (#7) December 18, 2023 16:36 ggml-cuda.h Configurable sparse prediction threshold (#7) December 18, 2023 16:36 ggml-impl.h ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060) November 13, 2023 16:55 ggml-metal.h ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060) November 13, 2023 16:55 ggml-metal.m ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060) November 13, 2023 16:55 ggml-metal.metal ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060) November 13, 2023 16:55 ggml-mpi.c ggml : remove src0 and src1 from ggml_tensor and rename opt to src (# ... July 11, 2023 19:31 ggml-mpi.h mpi : add support for distributed inference via MPI (#2099) July 10, 2023 18:49 ggml-opencl.cpp CLBlast: Add outer loops over src0 for broadcasting in mulmat October 20, 2023 22:30 ggml-opencl.h Leverage mmap for offloading tensors to GPU (#1597) June 12, 2023 14:44 ggml-quants.c support axpy q4_0 for loop December 12, 2023 15:03 ggml-quants.h merge PowerInfer impl from the internal codebase December 12, 2023 11:08 ggml.c Online GPU slicing (#11) December 20, 2023 10:09 ggml.h Configurable sparse prediction threshold (#7) December 18, 2023 16:36 llama.cpp Add instructions for easier hands-on (#20) December 20, 2023 19:08 llama.h Online GPU slicing (#11) December 20, 2023 10:09 mypy.ini scripts: Generalize convert scripts (#3838) November 9, 2023 11:09 requirements.txt Add instructions for easier hands-on (#20) December 20, 2023 19:08 run_with_preset.py llama : remove mtest (#3177) September 15, 2023 10:28 unicode.h Work on the BPE tokenizer (#3252) October 3, 2023 09:16 View code [ ] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU TL;DR Demo Abstract Features Getting Started Setup and Installation Get the Code Build Model Weights Download PowerInfer GGUF via Hugging Face Convert from Original Model Weights + Predictor Weights Inference Quantization Evaluation FAQs TODOs Paper and Citation Acknowledgement README.md PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU TL;DR PowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device. Demo powerinfer-live-demo.mp4 PowerInfer v.s. llama.cpp on a single RTX 4090(24G) running Falcon (ReLU)-40B-FP16 with a 11x speedup! [Both PowerInfer and llama.cpp were running on the same hardware and fully utilized VRAM on RTX 4090.] Abstract We introduce PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. Features PowerInfer is a high-speed and easy-to-use inference engine for deploying LLMs locally. PowerInfer is fast with: * Locality-centric design: Utilizes sparse activation and 'hot'/ 'cold' neuron concept for efficient LLM inference, ensuring high speed with lower resource demands. * Hybrid CPU/GPU Utilization: Seamlessly integrates memory/ computation capabilities of CPU and GPU for a balanced workload and faster processing. PowerInfer is flexible and easy to use with: * Easy Integration: Compatible with popular ReLU-sparse models. * Local Deployment Ease: Designed and deeply optimized for local deployment on consumer-grade hardware, enabling low-latency LLM inference and serving on a single GPU. * Backward Compatibility: While distinct from llama.cpp, you can make use of most of examples/ the same way as llama.cpp such as server and batched generation. PowerInfer also supports inference with llama.cpp's model weights for compatibility purposes, but there will be no performance gain. You can use these models with PowerInfer today: * Falcon-40B * Llama2 family We have tested PowerInfer on the following platforms: * x86-64 CPU (with AVX2 instructions) on Linux * x86-64 CPU and NVIDIA GPU on Linux * Apple M Chips on macOS (As we do not optimize for Mac, the performance improvement is not significant now.) And new features coming soon: * Mistral-7B model * Metal backend for sparse inference on macOS Getting Started * Installation * Model Weights Setup and Installation Get the Code git clone https://github.com/SJTU-IPADS/PowerInfer cd PowerInfer pip install -r requirements.txt # install Python helpers' dependencies Build In order to build PowerInfer you have two different options. These commands are supposed to be run from the root directory of the project. Using CMake(3.13+) on Linux or macOS: * If you have an NVIDIA GPU: cmake -S . -B build -DLLAMA_CUBLAS=ON cmake --build build --config Release * If you just CPU: cmake -S . -B build cmake --build build --config Release Model Weights PowerInfer models are stored in a special format called PowerInfer GGUF based on GGUF format, consisting of both LLM weights and predictor weights. Download PowerInfer GGUF via Hugging Face You can obtain PowerInfer GGUF weights at *.powerinfer.gguf as well as profiled model activation statistics for 'hot'-neuron offloading from each Hugging Face repo below. Base Model PowerInfer GGUF LLaMA(ReLU)-2-7B PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF LLaMA(ReLU)-2-13B PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF Falcon(ReLU)-40B PowerInfer/ReluFalcon-40B-PowerInfer-GGUF LLaMA(ReLU)-2-70B PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF We suggest downloading/cloning the whole repo so PowerInfer can automatically make use of such directory structure for feature-complete model offloading: . +-- *.powerinfer.gguf (Unquantized PowerInfer model) +-- *.q4.powerinfer.gguf (INT4 quantized PowerInfer model, if available) +-- activation (Profiled activation statistics for fine-grained FFN offloading) | +-- activation_x.pt (Profiled activation statistics for layer x) | +-- ... +-- *.[q4].powerinfer.gguf.generated.gpuidx (Generated GPU index at runtime for corresponding model) Convert from Original Model Weights + Predictor Weights Hugging Face limits single model weight to 50GiB. For unquantized models >= 40B, you can convert PowerInfer GGUF from the original model weights and predictor weights obtained from Hugging Face. Base Model Original Model Predictor LLaMA(ReLU) SparseLLM/ PowerInfer/ -2-7B ReluLLaMA-7B ReluLLaMA-7B-Predictor LLaMA(ReLU) SparseLLM/ PowerInfer/ -2-13B ReluLLaMA-13B ReluLLaMA-13B-Predictor Falcon(ReLU) SparseLLM/ PowerInfer/ -40B ReluFalcon-40B ReluFalcon-40B-Predictor LLaMA(ReLU) SparseLLM/ PowerInfer/ -2-70B ReluLLaMA-70B ReluLLaMA-70B-Predictor You can use the following command to convert the original model weights and predictor weights to PowerInfer GGUF: # make sure that you have done `pip install -r requirements.txt` python convert.py --outfile /PATH/TO/POWERINFER/GGUF/REPO/MODELNAME.powerinfer.gguf /PATH/TO/ORIGINAL/MODEL /PATH/TO/PREDICTOR # python convert.py --outfile ./ReluLLaMA-70B-PowerInfer-GGUF/llama-70b-relu.powerinfer.gguf ./SparseLLM/ReluLLaMA-70B ./PowerInfer/ReluLLaMA-70B-Predictor For the same reason, we suggest keeping the same directory structure as PowerInfer GGUF repos after conversion. Inference For CPU-only and CPU-GPU hybrid inference with all available VRAM, you can use the following instructions to run PowerInfer: ./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt # ./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" If you want to limit the VRAM usage of GPU: ./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb # ./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8 Under CPU-GPU hybrid inference, PowerInfer will automatically offload all dense activation blocks to GPU and split FFN on GPU if possible. Quantization PowerInfer has optimized quantization support for INT4(Q4_0) models. You can use the following instructions to quantize PowerInfer GGUF model: ./build/bin/quantize /PATH/TO/MODEL /PATH/TO/OUTPUT/QUANTIZED/MODEL Q4_0 # ./build/bin/quantize ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer.gguf ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf Q4_0 Then you can use the quantized model for inference with PowerInfer with the same instructions as above. Evaluation github-eval-4090 github-eval-2080ti-q4 PowerInfer achieves up to 11x and 8x speedup for FP16 and INT4 models! FAQs 1. What if I encountered CUDA_ERROR_OUT_OF_MEMORY? + You can try to run with --reset-gpu-index argument to rebuild the GPU index for this model to avoid any stale cache. + Due to our current implementation, model offloading might not be as accurate as expected. You can try with --vram-budget with a slightly lower value or --disable-gpu-index to disable FFN offloading. 2. What if... + Issues are welcomed! Please feel free to open an issue and attach your running environment and running parameters. We will try our best to help you. TODOs We will release the code and data in the following order, please stay tuned! * [*] Release core code of PowerInfer, supporting Llama-2, Falcon-40B. * [ ] Support Mistral-7B * [ ] Support Windows * [ ] Support text-generation-webui * [ ] Release perplexity evaluation code * [ ] Support Metal for Mac * [ ] Release code for OPT models * [ ] Release predictor training code * [*] Support online split for FFN network * [ ] Support Multi-GPU Paper and Citation More technical details can be found in our paper. If you find PowerInfer useful or relevant to your project and research, please kindly cite our paper: @techreport{song2023powerinfer, author = {Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen}, title = {PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU}, institution = {Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University}, year = {2023} } Acknowledgement We are thankful for the easily modifiable operator library ggml and execution runtime provided by llama.cpp. We also extend our gratitude to THUNLP for their support of ReLU-based sparse models. We also appreciate the research of Deja Vu, which inspires PowerInfer. About High-speed Large Language Model Serving on PCs with Consumer-grade GPUs Topics falcon llama large-language-models llm local-inference llm-inference Resources Readme License MIT license Activity Stars 1.7k stars Watchers 35 watching Forks 60 forks Report repository Releases No releases published Packages 0 No packages published Contributors 4 * @YixinSong-e YixinSong-e Jeremy Song * @hodlen hodlen Holden X * @ZeyuMi ZeyuMi zeyu * @eltociear eltociear Ikko Eltociear Ashimine Languages * C 41.6% * C++ 30.3% * Cuda 10.9% * Python 7.4% * Metal 3.2% * Objective-C 2.7% * Other 3.9% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.