[HN Gopher] Serving 70B-scale LLMs efficiently on low-resource e...
___________________________________________________________________
Serving 70B-scale LLMs efficiently on low-resource edge devices
[pdf]
Author : simonpure
Score : 192 points
Date : 2024-10-03 14:11 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| loufe wrote:
| It would be nice for the inference time to be paired with measure
| of output quality. I'm not well versed in how the architecture
| works, but I have a hard time believing a 90% reduction in peak
| memory footprint comes cost-free.
| not_a_dane wrote:
| Nothing is free in this world.
| zackangelo wrote:
| I've only read the abstract but they don't mention quantizing
| the weights or otherwise trying to shrink the model in any way.
|
| They're claiming to be able to efficiently run larger models
| without loading the entire thing into GPU memory. If they're
| using the same weights, the same architecture and just using
| tensor parallel operations to perform the forward pass that
| would imply no loss in quality.
|
| I'm sure there are trade-offs but they're not clear by just
| looking at the abstract.
| tgtweak wrote:
| I read it like this too - no drop in weights or model quality
| just optimizing the lower boundaries of performance when you
| are splitting from vram to ram to disk (or network).
| freehorse wrote:
| From what I get skimming through the article the main cost is
| speed of token generation (token latency). You can always run a
| large model by reading directly from the disk and not care much
| about ram; but it is very slow. They try to improve that aspect
| doing some optimisations, but it is still definitely slower
| than using ram or vram.
| refulgentis wrote:
| Table 3 directly refutes this* and claims 0 tradeoffs.**
|
| Below that, they indicate that a key part of the
| implementation is loading weights from disk _before_ they 're
| needed using a separate thread.***
|
| * maybe I'm missing something though, someone please triple
| check :)
|
| ** ttft (time to first token) and s/token (seconds per token)
| are both lower than any alternative in all cases.
|
| *** "daemon thread asynchronously preloads the weights"
| sgc wrote:
| I want to add that their chart shows s/token _per device_
| (edit: as per the heading on table 1 - it could also be
| confusing grammar), so it sounds like you are getting 4x
| the listed s /t on their 4 laptop cluster . Their laptops
| are not even hardwired - they are connecting over wifi.
|
| This comes at a very interesting time for me. I have an
| ancient dual xeon workstation with 64gb memory that I was
| researching how to convert to run an llm. I can just run
| that with 4 instances on the same machine and see how it
| goes, without purchasing a better GPU, to start. It sounds
| like this will allow you to run very large models with
| minimal quants, on craigslist quality devices.
|
| If it does what they say it does (and it seems to do), it
| will be an absolute game changer for most users.
| woadwarrior01 wrote:
| It's not cost-free. It comes at the cost of greatly increased
| latency. 29.9 seconds per token with Llama 3.1-70B. This is
| from Table 1 (pg 8) of the paper.
| m3kw9 wrote:
| Ah the disk swap method
| thelastparadise wrote:
| Is there any predictability/patterns for neuron/layer
| activation? If so, would it be reasonable to have a second
| tiny model that specifically tries to predict activation
| and preemptively swap those into memory?
| tcdent wrote:
| Depends on the architecture, but generally you just move
| through the layers linearly. Simple iteration.
|
| The number of layers, and the amount of time spent in
| each of them, makes me think any benefit from pre-loading
| the layer ahead is negligible.
|
| You really need the entire model on device to consider it
| performant.
| miki123211 wrote:
| This isn't how neural networks work.
|
| For vanilla models, you always use all the weights. That
| isn't true for mixture-of-experts, though, and in that
| setting, your approach has merit.
| _ache_ wrote:
| It's not disk swap. It's multi-devices LLM.
| kridsdale3 wrote:
| That looked like an analogy. Back in the days of a
| mechanical arm moving magnetic fields around in our PCs,
| you could have the illusion of infinite RAM as long as
| you're ok with microsecond operations now taking two
| million times longer. This is akin.
| wpietri wrote:
| I think the point is that it has the same sort of latency
| tradeoff that disk swap did: it's awful, but sometimes
| better than nothing.
| _ache_ wrote:
| That is s/token and not token/s. The cost is high.
|
| The actual goal of the article is to highlight that we can
| optimise the overall speed by decreasing link latency. Yeah
| link latency, because it's not 1 machine but several low
| devices that are used together to serve the 70B LLM.
| teraflop wrote:
| Am I just misunderstanding, or is the paper using "latency"
| when what they really mean is "throughput"?
|
| In other words, if I want 100 tokens of output, do I have to
| wait 2990 seconds? If so, the terminology seems unnecessarily
| confusing.
| dvh wrote:
| So when will I be able to "sudo apt-get install llm" ?
| jsanders9 wrote:
| Ollama is close...
| mysterhawk wrote:
| You can already do it with llamafile, checkout the project, it
| lets you convert a .gguf model in a portable executable
| mysterhawk wrote:
| And everything runs way faster on cpu like that
| mysterhawk wrote:
| I could run at 1token/s on qwen 2.5 72b q4_k_m on my i7
| 8750h + alot of ram XD
| mysterhawk wrote:
| With this model:
|
| https://huggingface.co/bartowski/Qwen2.5-72B-Instruct-
| GGUF
|
| Man I need to test the q8 version with llamafiles
| optimizations, it would be so nice to host it locally
| with the new ryzens, it could maybe fit my 96GB of ram
| mysterhawk wrote:
| https://justine.lol/matmul/
|
| https://www.youtube.com/watch?v=-mRi-B3t6fA
|
| Checkout these articles
| yjftsjthsd-h wrote:
| I'm not aware of any Debian family distro that packages it, but
| NixOS has at least ollama and llama-cpp in its repos. Honestly
| even if the more stable distributions did have these things
| packaged, I would hesitate to use the packaged versions because
| all of this stuff is still so quickly moving that you'd be on
| an old version and it would hurt.
|
| Edit: Arch has ollama in official repos too. OpenSUSE has
| https://software.opensuse.org/package/ollama .
| paxys wrote:
| You already can with ollama
| o11c wrote:
| Realistically, you probably want to wait until Vulkan support
| trickles out. That way, you aren't at the whim of the various
| evil hardware drivers (everybody's suck), and the AI can give
| you a disappointingly confused answer much faster than running
| the LLM on a CPU can.
| vessenes wrote:
| This is not a memory reduction technique that's somehow magical.
| Well, it does manage memory with some clever scheduling. The core
| of this idea is that you can schedule out inference on edge nodes
| in a memory and bandwidth optimized way that's a bit different
| than just splitting layers.
|
| They propose that right now computation and latency dominate the
| costs for multi-node inference, and pick a network topology
| (star) that is savvy to that.
|
| That said, it's 26-29 seconds per token for llama2-70b with their
| 8 edge devices, each using 4 gigs of RAM. That's amazing that
| they can run it at all, but this isn't going to be viable at the
| edge with current hardware.
|
| I think the paper makes the case that you could probably recruit
| say your 30 graphics workstations to do much faster inference
| without just nailing your LAN bandwidth, though.
|
| Upshot - interesting paper -- smart ideas, large frontier models
| still need very exotic hardware and bandwidth interconnects -
| this may point a way forward to lowering the bandwidth
| interconnects part of the story.
| tgtweak wrote:
| I think the main advantage here is you COULD run it, even it it
| takes a while. That is a step up from current model limitations
| which require ram or vram to hold the model.
|
| I think this lays some groundwork for running a 400B model on a
| 3090/4090 or even smaller GPU. If you can get a huge model like
| that running on a single gpu even if the mean time per token is
| in the seconds, that's acceptable for many use cases.
|
| If this same technique can be used to extend context windows in
| addition to token autocomplete, that would be great in it's own
| regard.
|
| Hopefully work like this continues as throwing a ton of vram at
| a model should be regarded as a performance optimization not
| necessarily a requirement.
| michaelt wrote:
| It's already technically possible to run huge models locally
| when you don't have the RAM/VRAM needed - llama.cpp can
| 'mmap' the model from disk.
|
| Of course an nvidia 4090 has a memory bandwidth of a 1000 GB
| per second; a CPU like the i7-13700K has a memory bandwidth
| of 90 GB per second; and a high-end NVMe SSD might only have
| read bandwidth of 10 GB per second.
|
| So in approximate terms, an LLM and quantisation level that
| can produce 10 tokens per second on a 4090 will produce 1
| token per second from RAM and a token every 10 seconds from
| SSD.
| ignoramous wrote:
| > _That is a step up from current model limitations which
| require ram or vram to hold the model._
|
| Current? Apple recently published a neat paper on how they
| optimise for both inference (cpu/gpu) and memory use:
| Our method involves constructing an inference cost model that
| takes into account the characteristics of flash memory,
| guiding us to optimize in two critical areas: reducing
| the volume of data transferred from flash and reading data in
| larger, more contiguous chunks. Within this hardware-
| informed framework, we introduce two principal techniques.
| First, "windowing" strategically reduces data transfer by
| reusing previously activated neurons, and second, "row-
| column bundling", tailored to the sequential data access
| strengths of flash memory, increases the size of data
| chunks read from flash memory. These methods
| collectively enable running models up to twice the size of
| the available DRAM, with up to 4x and 20x increase in
| inference speed compared to naive loading approaches in CPU
| and GPU, respectively.
|
| https://news.ycombinator.com/item?id=38704982
| diggan wrote:
| > I think the main advantage here is you COULD run it, even
| it it takes a while.
|
| I mean, you COULD run it before as well, even if you don't
| have enough RAM or VRAM, by using something like `zram`. It'd
| probably be even slower (and border-line usable, depending on
| the use case), but it's not impossible to get things to
| _run_.
| alchemist1e9 wrote:
| > I think the paper makes the case that you could probably
| recruit say your 30 graphics workstations to do much faster
| inference without just nailing your LAN bandwidth, though.
|
| Could be a big deal if it allows cluster of smaller GPUs to
| compete with a single large VRAM GPU.
|
| Unfortunately I'm a few months of date - which is an eternity
| in LLM inference techniques - so I'm not sure what current
| state of distributed inference looks like.
| Palomides wrote:
| llama.cpp supports splitting work across multiple nodes on a
| network already
|
| it essentially just copies a chunk of the model to each one,
| works well for situations where each machine has limited vram
| ay wrote:
| Any pointers to RTFM / llama repo for that ? I could not
| find anything on a cursory look. Thanks in advance !
| Palomides wrote:
| https://github.com/ggerganov/llama.cpp/tree/master/exampl
| es/...
|
| you run that on the remote nodes
| vessenes wrote:
| Yeah, I think these methods could be baked into llama.cpp or
| some other higher up the toolchain python library or what
| have you. They shard out each layer (ish?) to the edges, and
| recombine that layer inference at the master node, while the
| outside edges load up their next bit if they need to; I would
| guess the devil is in the details for all the possible tensor
| types and architectures (for instance, how shall we implement
| skip layers?).
| k1musab1 wrote:
| Do you think this could allow distributed inference only, or
| opens the door for distributed training of the model?
| Democratization of the models is in part hampered by the total
| compute a single person or small group can make use of, but if
| a project like folding@home, but for training large models is
| possible, it could change the game somewhat.
| Zetaphor wrote:
| Is this different from (or related to) the work being done by the
| exo project?
|
| https://github.com/exo-explore/exo
| tgtweak wrote:
| Exo is for partitioning over network across devices
| (implementing some bandwidth-reducing partitions) but still
| requires a minimum ram/vram requirement to load a model. This
| could, in theory, be combined to allow larger models to run on
| exo clusters with less gpu/ram than is required by the
| underlying model (at the cost of some performance no doubt, but
| still).
| alexandercheema wrote:
| exo maintainer here. tgtweak is correct.
|
| This looks like potentially some promising research that I'm
| looking into reproducing now. We want to lower the barrier to
| running large models as much as possible so if this works, it
| would be a potential addition to the exo offering.
| tgtweak wrote:
| Is there a cuda implementation of this... asking for a friend
| adam_arthur wrote:
| While I do think there's going to be a huge market for cloud-
| based LLM serving, the fact that consumer hardware can run close
| to SOTA models fairly easily (e.g. high-RAM MBP config), seems to
| me that the provider market won't be as big as investors are
| betting on.
|
| Most of the rewards will be reaped by consumers rather than
| providers.
|
| We're also in an age where the current levels of RAM in consumer
| devices were almost entirely optimized for prior to the existence
| of LLMs. I find it highly likely vendors will optimize for higher
| RAM capacity over other priorities in future hardware.
|
| How long until a 256GB RAM laptop (shared with GPU) is reasonably
| cheap/available? I give it a few years at most.
|
| It's possible that models grow orders of magnitude larger, but I
| find it more likely that the size of models will grow along the
| curve of cost of training decreasing/hardware cost improvements.
| There will be a sweet spot where it's economical to train larger
| models, and private companies won't push much beyond that.
| lxgr wrote:
| Enterprises use LLMs too, and quite often there wouldn't be any
| client you could reasonably run the model on. (You wouldn't
| want to e.g. have an LLM summarize and categorize a user
| request on _their_ device, since that would require you
| shipping your model and /or internal knowledge base to the
| client).
| adam_arthur wrote:
| Yes, but if you can run a sufficient LLM on a $2,000 laptop,
| then the cost to serve it from the cloud will be similarly
| cheap. (e.g. reserve an appropriately sized EC2 instance for
| pennies on the dollar)
|
| It's a highly competitive market. Companies aren't going to
| pay 100k/year to run a model on something that can run on a
| 2k consumer grade device.
|
| 128GB of gpu accessible/fast RAM can be had for $5000 on a
| macbook pro today. What will it be 3-4 years from now on
| linux/windows machines?
|
| And we still haven't seen any SoC providers try to optimize
| for RAM capacity over compute yet.
| lxgr wrote:
| Oh yes, I could definitely see the privacy-preserving
| consumer use case creating sufficient demand for efficiency
| that also bleeds over into the enterprise market.
|
| That's what's happened with power efficiency and ARM CPUs,
| after all!
| adunsulag wrote:
| This is where I want highly sensitive healthcare
| consumers of LLMs to be at. Note summation, suggested
| diagnosis (provider always in control), and other
| augmented abilities for the clinical staff without the
| risk of health care data sent outside the device, or the
| very local network.
| adam_arthur wrote:
| Not sure what you mean:
| https://aws.amazon.com/ec2/graviton/
|
| Not to speak of managed cloud services that run on ARM
| under-the-hood/behind the scenes.
|
| Of course ARM isn't inherently cheaper, AMD+Intel could
| cut prices/margins big and probably be competitive on
| $/perf
| simsla wrote:
| Depends, shipping part of it (just an encoder or decoder)
| could still work.
| lxgr wrote:
| Even if bandwidth weren't an issue and all users had
| compatible hardware: You'd still be offloading a
| (semi-)trusted computation to user hardware, which is
| usually completely untrusted.
___________________________________________________________________
(page generated 2024-10-03 23:00 UTC)