[HN Gopher] High-Speed Large Language Model Serving on PCs with ...
___________________________________________________________________
High-Speed Large Language Model Serving on PCs with Consumer-Grade
GPUs
Author : dataminer
Score : 253 points
Date : 2023-12-20 13:46 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| brucethemoose2 wrote:
| This is super cool.
|
| For all the love llama.cpp gets, its method of dGPU offloading
| (prompt processing on GPU and then just splitting the model down
| the middle) is relatively simple. But its interesting that there
| even _is_ so much "activation sparsity" to take advantage of.
| The traditional thinking in ML is that memory access is very
| random.
|
| Hopefully the "cold" neurons eventually get offloaded to the IGP
| instead?
|
| Also, its curious that they are considering a Metal kernel. I
| thought the performance advantage came from the hybrid memory
| pool... seems like that would only help old AMD Macs, unless I am
| missing something?
| sroussey wrote:
| The only thing I could think of on the question of Apple
| Silicon and Metal is that they think they could still split out
| the cold neurons to the CPU/Accelerate and the hot ones on the
| GPU and utilize both. The speedup is likely less if there is
| already no copying of data between GPU/CPU and using the
| unified memory. Still, it would be great if you could use even
| more of the capabilities of the chip simultaneously. In order
| to avoid thermal throttling they should use the efficiency
| cores only (I think this is what game mode does).
| coder543 wrote:
| "Power*" made me think of Microsoft, so I was almost expecting
| this to be Windows-specific. (PowerShell, PowerPoint, Power BI,
| Power Apps, Power Automate... I'm probably forgetting some.)
| HPsquared wrote:
| PowerToys are probably the original (going back to PowerToys
| for Windows 95)
|
| Edit: https://socket3.wordpress.com/2016/10/22/using-
| windows-95-po...
| coder543 wrote:
| PowerPoint existed in the late 80s, I think, although
| Microsoft acquired it from what I understand.
| latchkey wrote:
| https://en.wikipedia.org/wiki/PowerPC
| EwanG wrote:
| The important stuff from the readme (if you're not looking to
| tinker with it directly):
|
| We have tested PowerInfer on the following platforms:
|
| x86-64 CPU (with AVX2 instructions) on Linux
|
| x86-64 CPU and NVIDIA GPU on Linux
|
| Apple M Chips on macOS (As we do not optimize for Mac, the
| performance improvement is not significant now.)
|
| And new features coming soon:
|
| Mistral-7B model
|
| Metal backend for sparse inference on macOS
| rahimnathwani wrote:
| Also worth mentioning the downloadable llama2 models, and the
| convert.py file.
| 127 wrote:
| Running uncensored Mixtral on this would be really nice. More
| than 3 bits quantized for 4090.
| eurekin wrote:
| Downvoters care to comment? Uncensored llm versions typically
| perform better (at least on benchmarks) to their "lobotomized"
| or aligned counterparts
| infotainment wrote:
| Probably because the parent comment didn't contain much of
| substance. "Oh, I'd love to see this with [insert my favorite
| model here]" doesn't really add a lot to the discussion.
|
| For example, the parent commenter could have talked about the
| specific attributes of that model that make it superior. I
| personally am aware that Mixtral is one of the best
| performing models right now, but is everyone else? Also, does
| Mixtral need to be uncensored? I've used vanilla Mistral for
| some...interesting...prompts and had no issues with it
| moralizing at me.
| BriggyDwiggs42 wrote:
| Lol
| mirekrusin wrote:
| Dual GPUs should be considered normal/consumer grade setup,
| hopefully they'll add it soon, on 4bits it's enough with plenty
| of space for context.
|
| This whole thing is a fork of llamacpp, also hoping it'll all
| go upstream sooner or later.
| legel wrote:
| Yeah, so they demo a bigger model on an RTX 4090 with 24 GB
| VRAM. Granted an implementation of sparse activations with the
| Mixture of Experts could be non-trivial, I think it's a
| brilliant move, that could potentially allow for even, e.g.,
| CPU only processing and/or much cheaper GPU processing...
| Mixtral technically already has neural network controlled
| sparse activations, but like the Inception meme says: we must
| go deeper...
| ekianjo wrote:
| how much speed increase do we get on CPU only configurations? has
| anyone tested it in such cases?
| ComputerGuru wrote:
| This architecture is specifically aimed at optimizing GPU use.
| NavinF wrote:
| CPU-only is impractical for most use cases and this will only
| become more true over time as models become larger. The
| mediocre perf/$ and perf/watt makes it not worth the effort
| jupp0r wrote:
| From my understanding in this implementation there is some amount
| of knowledge about the model itself needed to determine what
| parts to place in system memory vs what parts to place in GPU
| memory. Can this ideally be computed automatically or will future
| models have some sort of interface for placement algorithms like
| this to help automate this? If the algorithm needs to be adopted
| for each model architecture, it's going to be a lot of work to
| maintain this project.
| loudmax wrote:
| That sounds about right. They provide a script to combine their
| "Predictor" weights to the original models, but I don't see
| anything obvious in the front page of the Github repo about how
| to create those weights.
|
| A 10x speed improvement is really impressive. If this kind of
| improvement is reproducible across other models, then
| presumably identifying hot and cold neurons for inference
| optimization should go on to become a normal part of model
| development process.
| thelastparadise wrote:
| Like JVM "hot spots," or JIT optimization.
| jupp0r wrote:
| Or profile guided optimization.
| phh wrote:
| Took me a while to understand what their "hot" and "cold" neurons
| meant, since in most ML I do, there is no such notion. And their
| paper doesn't directly define it (or I missed it)
|
| After some thoughts, in ReLU it does make sense, because half of
| the function is constant, so you can say that you're "cold" if
| that neuron's ReLU-ed output is often 0 . So I checked whether
| ReLU was common in LLMs, original llama doesn't use ReLU. But
| after (re-)reading the github, it actually only works on ReLU
| models. Turns out that there is a group of people "fine-tuning"
| (I would rather call that re-training, since you start by
| breaking the model?) models to use ReLU to allow for that
| sparsity: https://huggingface.co/SparseLLM
|
| So this is sadly not applicable to any model you can find on the
| internet, but that sounds like a great progress anyway. Possibly
| this might shift the compromises back to bigger models but with
| "less ideal" activations. Also I'm curious what would be the
| legal impacts on it (since USA and EU refers to a model's
| FLOPs/number of parameters... How do you compute it with
| sparsity? Do you average?)
|
| I think that a possible avenue for future research in that area
| is keeping original activation (like llama keeping SwiGLU), but
| using quantification to define "hot" and "cold" neurons to be
| saturation areas. (For example, saying that this activation
| function, below -1. at 8 bit, is equivalent to -infinity, and
| thus this is a cold neuron)
| brucethemoose2 wrote:
| That is a huge caveat to leave out of a readme, especially one
| that claims llama compatibility.
| acqq wrote:
| Indeed
|
| https://huggingface.co/SparseLLM/ReluFalcon-40B
|
| "We utilize PowerInfer for inference"
| boredumb wrote:
| > Also I'm curious what would be the legal impacts on it (since
| USA and EU refers to a model's FLOPs/number of parameters...
| How do you compute it with sparsity? Do you average?).
|
| How/when did these types of regulations come about? This feels
| like an insane thing to have to keep in mind while developing.
| radicalbyte wrote:
| The EU messed up with the GDPR - they should have implemented
| it at least a decade earlier and ignored the lobby which lead
| to the cookie banner instead of either an outright ban on
| tracking for all but a tiny number of purposes. Such a ban
| would have had a negligible impact on the tech industry
| financially but would have had huge privacy rewards.
|
| They're trying to get in early on AI so as not to make the
| same mistake again. Which might result in them making the
| opposite mistake.
| quocanh wrote:
| Tiny negligible impact on the industry (Except cut
| advertising revenue in half, but who cares. What do ads pay
| for anyways?)
| phh wrote:
| > How/when did these types of regulations come about?
|
| I can't say much about US. As I see it, EU pretty much copied
| US about that part. There was nothing related to computation
| in the EU's AI Act projects until few months ago, it was
| purely a "what kind of data processing are you allowed to
| do?"
| alchemist1e9 wrote:
| Politely, what the hell are you talking about? Who is
| telling anyone what they can or cannot compute?
| iamjackg wrote:
| US:
|
| https://www.whitehouse.gov/briefing-room/presidential-
| action...
|
| "Until such technical conditions are defined, the
| Secretary shall require compliance with these reporting
| requirements for: (i) any model
| that was trained using a quantity of computing power
| greater than 1026 integer or floating-point operations,
| or using primarily biological sequence data and using a
| quantity of computing power greater than 1023 integer or
| floating-point operations[...]"
|
| EU:
|
| https://thefuturesociety.org/wp-
| content/uploads/2023/12/EU-A...
| geon wrote:
| > 1026
|
| > 1023
|
| Should be 10^26 and 10^23.
| alchemist1e9 wrote:
| Probably I did this wrong but I'm getting an
| approximation of 300K H100s completes that in a month. At
| least they choose something fairly large it seems. Not
| sure how LoRA or other incremental training is handled.
| two_in_one wrote:
| Passing though ChatGPT4, (actually nothing specific,
| mostly empty words)
|
| Summary:
|
| The Executive Order focuses on the safe, secure, and
| trustworthy development and use of Artificial
| Intelligence (AI). It outlines a government-wide approach
| to manage AI responsibly, addressing potential societal
| harms like fraud, bias, and security risks. The order
| establishes guiding principles and policies for AI
| development, emphasizing safety, innovation, workers'
| rights, equity, consumer protection, privacy, government
| use of AI, and global leadership. It includes detailed
| definitions and actions for government agencies to ensure
| AI is developed and used ethically and effectively.
|
| about power/size:
|
| The Executive Order does not specifically mention the
| size of AI models or compute power in terms of FLOPs
| (Floating Point Operations per Second). It focuses more
| broadly on the principles and policies for responsible AI
| development and use, without delving into technical
| specifics like model size or compute requirements.
|
| about what developers have to do after this order:
|
| New developers of AI models, after this Executive Order,
| are encouraged to align their AI development and use with
| the outlined principles and policies. These focus on
| ensuring AI is safe, secure, trustworthy, and ethically
| developed, while addressing societal harms such as bias
| and privacy concerns. Developers should consider how
| their AI impacts equity, innovation, consumer protection,
| and workers' rights, and adhere to guidelines for
| responsible government use of AI.
| cyanydeez wrote:
| anyone with a functional government.
| ComputerGuru wrote:
| It's not too much faster than exllama2 with flash attention, no?
| modeless wrote:
| Everyone compares against llama.cpp because it's easy mode.
| Llama.cpp is slow! Everyone should know this. They should compare
| against exllamav2 or other optimized implementations.
| nulld3v wrote:
| ExLlama is GPU only right? This speedup is for GPU + CPU split
| use cases.
| modeless wrote:
| Oh I see, they are running a 40B model unquantized, whereas
| exllamav2 would have to use 4-bit quantization to fit. Given
| the quality of 4-bit quantization these days and the speed
| boost it provides I question the utility of running
| unquantized for serving purposes.
|
| I see they have a 4-bit benchmark lower down in the page.
| That's where they ought to compare against exllamav2.
| sroussey wrote:
| What do you recommend that is faster that I can package into an
| app for distribution?
| modeless wrote:
| I have packaged exllamav2 (plus a lot of other stuff) into an
| app for distribution here:
| https://apps.microsoft.com/detail/9NC624PBFGB7
|
| I used pyinstaller. It was difficult because Python makes
| these things difficult. But it works. It does require an
| Nvidia GPU. MLC-LLM is another option that might be easier to
| package and potentially able to run on AMD.
| sroussey wrote:
| Oh yeah, I want to work on AMD/Intel/NVIDIA and MacOS, even
| iOS/Android.
|
| I've been following MLC-LLM as well. Right now I am just
| using JS/WASM from Huggingface, but later I will want
| something more performant.
| modeless wrote:
| Yeah if you want maximum performance on multiple
| platforms you'll probably have to package multiple
| frameworks. Llama.cpp might be a decently fast option on
| Apple Silicon, I'm not sure of the state of the art
| there.
| avereveard wrote:
| Yeah but exllama doesn't do grammars so I'm stuck with
| llama.cpp
|
| Also apparently exllama has a few side effects in coherence
| https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_for...
| superkuh wrote:
| In this case they're comparing against llama.cpp because the
| code is literally a modification of llama.cpp. I'm not talking
| about using the ggml lib for matrix calculations, it's
| literally using the llama.cpp main.cpp and other normal
| llama.cpp code. It's a fork. It is _directly_ comparable.
|
| https://github.com/ggerganov/llama.cpp/pull/4543 [Review] Merge
| PowerInfer with llama.cpp mainline #4543
|
| https://github.com/ggerganov/llama.cpp/discussions/4534#disc...
| "The x11 speedup is kind of cherrypicked because the llama.cpp
| GPU code for Falcon 40b is just not well-optimized."
| nextaccountic wrote:
| > Hybrid CPU/GPU Utilization: Seamlessly integrates
| memory/computation capabilities of CPU and GPU for a balanced
| workload and faster processing.
|
| Does this means that it runs at same time at both CPU and GPU,
| being faster than a CPU-only or a GPU-only implementation on the
| same device?
|
| edit: when running on integrated GPUs, can this benefit from the
| improved communication between CPU and GPU?
| rahimnathwani wrote:
| GPU-only will be faster if you have enough VRAM.
|
| But if you want to run a model that requires more VRAM than you
| have, the current approach is to use llama.cpp and specify
| n_gpu_layers. That works, but is slower than GPU-only.
|
| OP claims to be 10x as fast as llama.cpp in the case when you
| can't fit the whole model in VRAM.
| causality0 wrote:
| All the "consumer grade GPUs" terminology makes it seem like you
| could run it on a variety of models, but like _so many_ of these
| posts, is this a 4090 exclusive?
| superkuh wrote:
| This will be really cool once there's the ability to generate the
| sparse predictor files for arbitrary models rather than just the
| 4 they've done it with. Looking through the page and code it
| doesn't seem like the tools to do that step are included. Guess
| I'll wait on this one a bit. Hopefully these features will be
| merged back into llama.cpp as options eventually since this is
| based on the normal llama.cpp code (ie, not just using the ggml
| matrix lib).
___________________________________________________________________
(page generated 2023-12-20 23:00 UTC)