https://github.com/SJTU-IPADS/PowerInfer

Skip to content
Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
SJTU-IPADS / PowerInfer Public

  * Notifications
  * Fork 60
  * Star 1.7k

High-speed Large Language Model Serving on PCs with Consumer-grade
GPUs

License

MIT license
1.7k stars 60 forks Activity
Star
Notifications

  * Code
  * Issues 11
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

SJTU-IPADS/PowerInfer

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/S]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone SJTU-I]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@eltociear
eltociear Update README.md (#24)
...
9fc075a Dec 20, 2023
Update README.md (#24)

genearted -> generated

9fc075a

Git stats

  * 1,544 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.devops
ci : Cloud-V for RISC-V builds (#3160)
September 15, 2023 11:06
.github
ci : use intel sde when ci cpu doesn't support avx512 (#3949)
November 5, 2023 09:46
ci
save-load-state : fix example + add ci test (#3655)
October 17, 2023 19:12
cmake
cmake : MSVC instruction detection (fixed up #809) (#3923)
November 5, 2023 10:03
common
Online GPU slicing (#11)
December 20, 2023 10:09
docs
Fix some documentation typos/grammar mistakes (#4032)
November 11, 2023 23:04
examples
Online GPU slicing (#11)
December 20, 2023 10:09
gguf-py
Online GPU slicing (#11)
December 20, 2023 10:09
grammars
Fix some documentation typos/grammar mistakes (#4032)
November 11, 2023 23:04
media
media : add logos and banners
April 5, 2023 18:58
models
stablelm : StableLM support (#3586)
November 14, 2023 11:17
pocs
build : enable more non-default compiler warnings (#3200)
September 28, 2023 17:41
powerinfer-py
Online GPU slicing (#11)
December 20, 2023 10:09
prompts
speculative : add tree-based sampling example (#3624)
October 18, 2023 16:21
scripts
Online GPU slicing (#11)
December 20, 2023 10:09
spm-headers
swift : Package compile breaks due to ggml-metal.metal (#1831)
June 15, 2023 20:47
tests
stablelm : StableLM support (#3586)
November 14, 2023 11:17
.clang-tidy
fix some warnings from gcc and clang-tidy (#3038)
September 7, 2023 13:22
.dockerignore
docker : ignore Git files (#3314)
October 2, 2023 11:53
.ecrc
Fix whitespace, add .editorconfig, add GitHub workflow (#883)
April 11, 2023 19:45
.editorconfig
server : add a subtle loading animation to the edit box (#2466)
September 4, 2023 16:28
.flake8
hooks : setting up flake8 and pre-commit hooks (#1681)
June 17, 2023 13:32
.gitignore
merge PowerInfer impl from the internal codebase
December 12, 2023 11:08
.pre-commit-config.yaml
hooks : setting up flake8 and pre-commit hooks (#1681)
June 17, 2023 13:32
CMakeLists.txt
add fallback for m chip & fix compiler bugs (#4)
December 14, 2023 22:53
LICENSE
Update LICENSE and TODOs in README (#14)
December 19, 2023 16:23
Makefile
Fix MacOS Sonoma model quantization (#4052)
November 14, 2023 12:34
Package.swift
ggml : quantization refactoring (#3833)
October 29, 2023 18:32
README.md
Update README.md (#24)
December 21, 2023 00:01
SHA256SUMS
Update SHA256SUMS with current hashes for models quantized using
q4_0...
June 11, 2023 12:38
build.zig
build : link against build info instead of compiling against it (#
3879)
November 2, 2023 08:50
codecov.yml
cov : disable comment in PRs (#2989)
September 3, 2023 13:19
convert-baichuan-hf-to-gguf.py
gguf-py: Refactor and allow reading/modifying existing GGUF files (#
3...
November 11, 2023 08:04
convert-hf-to-gguf.py
stablelm : StableLM support (#3586)
November 14, 2023 11:17
convert-hf-to-powerinfer-gguf.py
merge PowerInfer impl from the internal codebase
December 12, 2023 11:08
convert-llama-ggml-to-gguf.py
gguf-py: Refactor and allow reading/modifying existing GGUF files (#
3...
November 11, 2023 08:04
convert-lora-to-ggml.py
convert : fix python 3.8 support, modernize type annotations (#2916)
August 31, 2023 08:02
convert-persimmon-to-gguf.py
gguf-py: Refactor and allow reading/modifying existing GGUF files (#
3...
November 11, 2023 08:04
convert.py
Add instructions for easier hands-on (#20)
December 20, 2023 19:08
flake.lock
flake.nix: fix for rocm 5.7 (#3853)
October 31, 2023 19:24
flake.nix
flake.nix: fix for rocm 5.7 (#3853)
October 31, 2023 19:24
ggml-alloc.c
sync : ggml (backend v2) (#3912)
November 13, 2023 14:16
ggml-alloc.h
sync : ggml (backend v2) (#3912)
November 13, 2023 14:16
ggml-backend-impl.h
sync : ggml (backend v2) (#3912)
November 13, 2023 14:16
ggml-backend.c
sync : ggml (backend v2) (#3912)
November 13, 2023 14:16
ggml-backend.h
sync : ggml (backend v2) (#3912)
November 13, 2023 14:16
ggml-cuda.cu
Configurable sparse prediction threshold (#7)
December 18, 2023 16:36
ggml-cuda.h
Configurable sparse prediction threshold (#7)
December 18, 2023 16:36
ggml-impl.h
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
November 13, 2023 16:55
ggml-metal.h
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
November 13, 2023 16:55
ggml-metal.m
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
November 13, 2023 16:55
ggml-metal.metal
ggml : sync (im2col, GPU conv, 32-bit arm compat) (#4060)
November 13, 2023 16:55
ggml-mpi.c
ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#
...
July 11, 2023 19:31
ggml-mpi.h
mpi : add support for distributed inference via MPI (#2099)
July 10, 2023 18:49
ggml-opencl.cpp
CLBlast: Add outer loops over src0 for broadcasting in mulmat
October 20, 2023 22:30
ggml-opencl.h
Leverage mmap for offloading tensors to GPU (#1597)
June 12, 2023 14:44
ggml-quants.c
support axpy q4_0 for loop
December 12, 2023 15:03
ggml-quants.h
merge PowerInfer impl from the internal codebase
December 12, 2023 11:08
ggml.c
Online GPU slicing (#11)
December 20, 2023 10:09
ggml.h
Configurable sparse prediction threshold (#7)
December 18, 2023 16:36
llama.cpp
Add instructions for easier hands-on (#20)
December 20, 2023 19:08
llama.h
Online GPU slicing (#11)
December 20, 2023 10:09
mypy.ini
scripts: Generalize convert scripts (#3838)
November 9, 2023 11:09
requirements.txt
Add instructions for easier hands-on (#20)
December 20, 2023 19:08
run_with_preset.py
llama : remove mtest (#3177)
September 15, 2023 10:28
unicode.h
Work on the BPE tokenizer (#3252)
October 3, 2023 09:16
View code
[                    ]
PowerInfer: Fast Large Language Model Serving with a Consumer-grade
GPU TL;DR Demo  Abstract Features Getting Started Setup and
Installation Get the Code Build Model Weights Download PowerInfer
GGUF via Hugging Face Convert from Original Model Weights + Predictor
Weights Inference Quantization Evaluation FAQs TODOs Paper and
Citation Acknowledgement

README.md

 PowerInfer: Fast Large Language Model Serving with a Consumer-grade
GPU

 TL;DR

PowerInfer is a CPU/GPU LLM inference engine leveraging activation
locality for your device.

 Demo 

powerinfer-live-demo.mp4

PowerInfer v.s. llama.cpp on a single RTX 4090(24G) running Falcon
(ReLU)-40B-FP16 with a 11x speedup!

[Both PowerInfer and llama.cpp were running on the same hardware and
fully utilized VRAM on RTX 4090.]

 Abstract

We introduce PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is
exploiting the high locality inherent in LLM inference, characterized
by a power-law distribution in neuron activation.

This distribution indicates that a small subset of neurons, termed
hot neurons, are consistently activated across inputs, while the
majority, cold neurons, vary based on specific inputs. PowerInfer
exploits such an insight to design a GPU-CPU hybrid inference engine:
hot-activated neurons are preloaded onto the GPU for fast access,
while cold-activated neurons are computed on the CPU, thus
significantly reducing GPU memory demands and CPU-GPU data transfers.
PowerInfer further integrates adaptive predictors and neuron-aware
sparse operators, optimizing the efficiency of neuron activation and
computational sparsity.

Evaluation shows that PowerInfer attains an average token generation
rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various
LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18%
lower than that achieved by a top-tier server-grade A100 GPU. This
significantly outperforms llama.cpp by up to 11.69x while retaining
model accuracy.

 Features

PowerInfer is a high-speed and easy-to-use inference engine for
deploying LLMs locally.

PowerInfer is fast with:

  * Locality-centric design: Utilizes sparse activation and 'hot'/
    'cold' neuron concept for efficient LLM inference, ensuring high
    speed with lower resource demands.
  * Hybrid CPU/GPU Utilization: Seamlessly integrates memory/
    computation capabilities of CPU and GPU for a balanced workload
    and faster processing.

PowerInfer is flexible and easy to use with:

  * Easy Integration: Compatible with popular ReLU-sparse models.
  * Local Deployment Ease: Designed and deeply optimized for local
    deployment on consumer-grade hardware, enabling low-latency LLM
    inference and serving on a single GPU.
  * Backward Compatibility: While distinct from llama.cpp, you can
    make use of most of examples/ the same way as llama.cpp such as
    server and batched generation. PowerInfer also supports inference
    with llama.cpp's model weights for compatibility purposes, but
    there will be no performance gain.

You can use these models with PowerInfer today:

  * Falcon-40B
  * Llama2 family

We have tested PowerInfer on the following platforms:

  * x86-64 CPU (with AVX2 instructions) on Linux
  * x86-64 CPU and NVIDIA GPU on Linux
  * Apple M Chips on macOS (As we do not optimize for Mac, the
    performance improvement is not significant now.)

And new features coming soon:

  * Mistral-7B model
  * Metal backend for sparse inference on macOS

 Getting Started

  * Installation
  * Model Weights

 Setup and Installation

 Get the Code

git clone https://github.com/SJTU-IPADS/PowerInfer
cd PowerInfer
pip install -r requirements.txt # install Python helpers' dependencies

 Build

In order to build PowerInfer you have two different options. These
commands are supposed to be run from the root directory of the
project.

Using CMake(3.13+) on Linux or macOS:

  * If you have an NVIDIA GPU:

cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release

  * If you just CPU:

cmake -S . -B build
cmake --build build --config Release

 Model Weights

PowerInfer models are stored in a special format called PowerInfer
GGUF based on GGUF format, consisting of both LLM weights and
predictor weights.

 Download PowerInfer GGUF via Hugging Face

You can obtain PowerInfer GGUF weights at *.powerinfer.gguf as well
as profiled model activation statistics for 'hot'-neuron offloading
from each Hugging Face repo below.

   Base Model                  PowerInfer GGUF
LLaMA(ReLU)-2-7B  PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
LLaMA(ReLU)-2-13B PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF
Falcon(ReLU)-40B  PowerInfer/ReluFalcon-40B-PowerInfer-GGUF
LLaMA(ReLU)-2-70B PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF

We suggest downloading/cloning the whole repo so PowerInfer can
automatically make use of such directory structure for
feature-complete model offloading:

.
+-- *.powerinfer.gguf (Unquantized PowerInfer model)
+-- *.q4.powerinfer.gguf (INT4 quantized PowerInfer model, if available)
+-- activation (Profiled activation statistics for fine-grained FFN offloading)
|   +-- activation_x.pt (Profiled activation statistics for layer x)
|   +-- ...
+-- *.[q4].powerinfer.gguf.generated.gpuidx (Generated GPU index at runtime for corresponding model)

 Convert from Original Model Weights + Predictor Weights

Hugging Face limits single model weight to 50GiB. For unquantized
models >= 40B, you can convert PowerInfer GGUF from the original
model weights and predictor weights obtained from Hugging Face.

  Base Model       Original Model                Predictor
LLaMA(ReLU)     SparseLLM/            PowerInfer/
-2-7B           ReluLLaMA-7B          ReluLLaMA-7B-Predictor
LLaMA(ReLU)     SparseLLM/            PowerInfer/
-2-13B          ReluLLaMA-13B         ReluLLaMA-13B-Predictor
Falcon(ReLU)    SparseLLM/            PowerInfer/
-40B            ReluFalcon-40B        ReluFalcon-40B-Predictor
LLaMA(ReLU)     SparseLLM/            PowerInfer/
-2-70B          ReluLLaMA-70B         ReluLLaMA-70B-Predictor

You can use the following command to convert the original model
weights and predictor weights to PowerInfer GGUF:

# make sure that you have done `pip install -r requirements.txt`
python convert.py --outfile /PATH/TO/POWERINFER/GGUF/REPO/MODELNAME.powerinfer.gguf /PATH/TO/ORIGINAL/MODEL /PATH/TO/PREDICTOR
# python convert.py --outfile ./ReluLLaMA-70B-PowerInfer-GGUF/llama-70b-relu.powerinfer.gguf ./SparseLLM/ReluLLaMA-70B ./PowerInfer/ReluLLaMA-70B-Predictor

For the same reason, we suggest keeping the same directory structure
as PowerInfer GGUF repos after conversion.

 Inference

For CPU-only and CPU-GPU hybrid inference with all available VRAM,
you can use the following instructions to run PowerInfer:

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt
# ./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

If you want to limit the VRAM usage of GPU:

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb
# ./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 8

Under CPU-GPU hybrid inference, PowerInfer will automatically offload
all dense activation blocks to GPU and split FFN on GPU if possible.

 Quantization

PowerInfer has optimized quantization support for INT4(Q4_0) models.
You can use the following instructions to quantize PowerInfer GGUF
model:

./build/bin/quantize /PATH/TO/MODEL /PATH/TO/OUTPUT/QUANTIZED/MODEL Q4_0
# ./build/bin/quantize ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer.gguf ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf Q4_0

Then you can use the quantized model for inference with PowerInfer
with the same instructions as above.

 Evaluation

github-eval-4090

github-eval-2080ti-q4

PowerInfer achieves up to 11x and 8x speedup for FP16 and INT4
models!

 FAQs

 1. What if I encountered CUDA_ERROR_OUT_OF_MEMORY?
      + You can try to run with --reset-gpu-index argument to rebuild
        the GPU index for this model to avoid any stale cache.
      + Due to our current implementation, model offloading might not
        be as accurate as expected. You can try with --vram-budget
        with a slightly lower value or --disable-gpu-index to disable
        FFN offloading.
 2. What if...
      + Issues are welcomed! Please feel free to open an issue and
        attach your running environment and running parameters. We
        will try our best to help you.

 TODOs

We will release the code and data in the following order, please stay
tuned!

  * [*] Release core code of PowerInfer, supporting Llama-2,
    Falcon-40B.
  * [ ] Support Mistral-7B
  * [ ] Support Windows
  * [ ] Support text-generation-webui
  * [ ] Release perplexity evaluation code
  * [ ] Support Metal for Mac
  * [ ] Release code for OPT models
  * [ ] Release predictor training code
  * [*] Support online split for FFN network
  * [ ] Support Multi-GPU

 Paper and Citation

More technical details can be found in our paper.

If you find PowerInfer useful or relevant to your project and
research, please kindly cite our paper:

@techreport{song2023powerinfer,
  author      = {Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen},
  title       = {PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU},
  institution = {Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University},
  year        = {2023}
}

 Acknowledgement

We are thankful for the easily modifiable operator library ggml and
execution runtime provided by llama.cpp. We also extend our gratitude
to THUNLP for their support of ReLU-based sparse models. We also
appreciate the research of Deja Vu, which inspires PowerInfer.

About

High-speed Large Language Model Serving on PCs with Consumer-grade
GPUs

Topics

falcon llama large-language-models llm local-inference llm-inference

Resources

Readme

License

MIT license
Activity

Stars

1.7k stars

Watchers

35 watching

Forks

60 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 4

  * @YixinSong-e YixinSong-e Jeremy Song
  * @hodlen hodlen Holden X
  * @ZeyuMi ZeyuMi zeyu
  * @eltociear eltociear Ikko Eltociear Ashimine

Languages

  * C 41.6%
  * C++ 30.3%
  * Cuda 10.9%
  * Python 7.4%
  * Metal 3.2%
  * Objective-C 2.7%
  * Other 3.9%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.