[HN Gopher] Serving 70B-scale LLMs efficiently on low-resource e...
       ___________________________________________________________________
        
       Serving 70B-scale LLMs efficiently on low-resource edge devices
       [pdf]
        
       Author : simonpure
       Score  : 192 points
       Date   : 2024-10-03 14:11 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | loufe wrote:
       | It would be nice for the inference time to be paired with measure
       | of output quality. I'm not well versed in how the architecture
       | works, but I have a hard time believing a 90% reduction in peak
       | memory footprint comes cost-free.
        
         | not_a_dane wrote:
         | Nothing is free in this world.
        
         | zackangelo wrote:
         | I've only read the abstract but they don't mention quantizing
         | the weights or otherwise trying to shrink the model in any way.
         | 
         | They're claiming to be able to efficiently run larger models
         | without loading the entire thing into GPU memory. If they're
         | using the same weights, the same architecture and just using
         | tensor parallel operations to perform the forward pass that
         | would imply no loss in quality.
         | 
         | I'm sure there are trade-offs but they're not clear by just
         | looking at the abstract.
        
           | tgtweak wrote:
           | I read it like this too - no drop in weights or model quality
           | just optimizing the lower boundaries of performance when you
           | are splitting from vram to ram to disk (or network).
        
         | freehorse wrote:
         | From what I get skimming through the article the main cost is
         | speed of token generation (token latency). You can always run a
         | large model by reading directly from the disk and not care much
         | about ram; but it is very slow. They try to improve that aspect
         | doing some optimisations, but it is still definitely slower
         | than using ram or vram.
        
           | refulgentis wrote:
           | Table 3 directly refutes this* and claims 0 tradeoffs.**
           | 
           | Below that, they indicate that a key part of the
           | implementation is loading weights from disk _before_ they 're
           | needed using a separate thread.***
           | 
           | * maybe I'm missing something though, someone please triple
           | check :)
           | 
           | ** ttft (time to first token) and s/token (seconds per token)
           | are both lower than any alternative in all cases.
           | 
           | *** "daemon thread asynchronously preloads the weights"
        
             | sgc wrote:
             | I want to add that their chart shows s/token _per device_
             | (edit: as per the heading on table 1 - it could also be
             | confusing grammar), so it sounds like you are getting 4x
             | the listed s /t on their 4 laptop cluster . Their laptops
             | are not even hardwired - they are connecting over wifi.
             | 
             | This comes at a very interesting time for me. I have an
             | ancient dual xeon workstation with 64gb memory that I was
             | researching how to convert to run an llm. I can just run
             | that with 4 instances on the same machine and see how it
             | goes, without purchasing a better GPU, to start. It sounds
             | like this will allow you to run very large models with
             | minimal quants, on craigslist quality devices.
             | 
             | If it does what they say it does (and it seems to do), it
             | will be an absolute game changer for most users.
        
         | woadwarrior01 wrote:
         | It's not cost-free. It comes at the cost of greatly increased
         | latency. 29.9 seconds per token with Llama 3.1-70B. This is
         | from Table 1 (pg 8) of the paper.
        
           | m3kw9 wrote:
           | Ah the disk swap method
        
             | thelastparadise wrote:
             | Is there any predictability/patterns for neuron/layer
             | activation? If so, would it be reasonable to have a second
             | tiny model that specifically tries to predict activation
             | and preemptively swap those into memory?
        
               | tcdent wrote:
               | Depends on the architecture, but generally you just move
               | through the layers linearly. Simple iteration.
               | 
               | The number of layers, and the amount of time spent in
               | each of them, makes me think any benefit from pre-loading
               | the layer ahead is negligible.
               | 
               | You really need the entire model on device to consider it
               | performant.
        
               | miki123211 wrote:
               | This isn't how neural networks work.
               | 
               | For vanilla models, you always use all the weights. That
               | isn't true for mixture-of-experts, though, and in that
               | setting, your approach has merit.
        
             | _ache_ wrote:
             | It's not disk swap. It's multi-devices LLM.
        
               | kridsdale3 wrote:
               | That looked like an analogy. Back in the days of a
               | mechanical arm moving magnetic fields around in our PCs,
               | you could have the illusion of infinite RAM as long as
               | you're ok with microsecond operations now taking two
               | million times longer. This is akin.
        
               | wpietri wrote:
               | I think the point is that it has the same sort of latency
               | tradeoff that disk swap did: it's awful, but sometimes
               | better than nothing.
        
           | _ache_ wrote:
           | That is s/token and not token/s. The cost is high.
           | 
           | The actual goal of the article is to highlight that we can
           | optimise the overall speed by decreasing link latency. Yeah
           | link latency, because it's not 1 machine but several low
           | devices that are used together to serve the 70B LLM.
        
           | teraflop wrote:
           | Am I just misunderstanding, or is the paper using "latency"
           | when what they really mean is "throughput"?
           | 
           | In other words, if I want 100 tokens of output, do I have to
           | wait 2990 seconds? If so, the terminology seems unnecessarily
           | confusing.
        
       | dvh wrote:
       | So when will I be able to "sudo apt-get install llm" ?
        
         | jsanders9 wrote:
         | Ollama is close...
        
         | mysterhawk wrote:
         | You can already do it with llamafile, checkout the project, it
         | lets you convert a .gguf model in a portable executable
        
           | mysterhawk wrote:
           | And everything runs way faster on cpu like that
        
             | mysterhawk wrote:
             | I could run at 1token/s on qwen 2.5 72b q4_k_m on my i7
             | 8750h + alot of ram XD
        
               | mysterhawk wrote:
               | With this model:
               | 
               | https://huggingface.co/bartowski/Qwen2.5-72B-Instruct-
               | GGUF
               | 
               | Man I need to test the q8 version with llamafiles
               | optimizations, it would be so nice to host it locally
               | with the new ryzens, it could maybe fit my 96GB of ram
        
           | mysterhawk wrote:
           | https://justine.lol/matmul/
           | 
           | https://www.youtube.com/watch?v=-mRi-B3t6fA
           | 
           | Checkout these articles
        
         | yjftsjthsd-h wrote:
         | I'm not aware of any Debian family distro that packages it, but
         | NixOS has at least ollama and llama-cpp in its repos. Honestly
         | even if the more stable distributions did have these things
         | packaged, I would hesitate to use the packaged versions because
         | all of this stuff is still so quickly moving that you'd be on
         | an old version and it would hurt.
         | 
         | Edit: Arch has ollama in official repos too. OpenSUSE has
         | https://software.opensuse.org/package/ollama .
        
         | paxys wrote:
         | You already can with ollama
        
         | o11c wrote:
         | Realistically, you probably want to wait until Vulkan support
         | trickles out. That way, you aren't at the whim of the various
         | evil hardware drivers (everybody's suck), and the AI can give
         | you a disappointingly confused answer much faster than running
         | the LLM on a CPU can.
        
       | vessenes wrote:
       | This is not a memory reduction technique that's somehow magical.
       | Well, it does manage memory with some clever scheduling. The core
       | of this idea is that you can schedule out inference on edge nodes
       | in a memory and bandwidth optimized way that's a bit different
       | than just splitting layers.
       | 
       | They propose that right now computation and latency dominate the
       | costs for multi-node inference, and pick a network topology
       | (star) that is savvy to that.
       | 
       | That said, it's 26-29 seconds per token for llama2-70b with their
       | 8 edge devices, each using 4 gigs of RAM. That's amazing that
       | they can run it at all, but this isn't going to be viable at the
       | edge with current hardware.
       | 
       | I think the paper makes the case that you could probably recruit
       | say your 30 graphics workstations to do much faster inference
       | without just nailing your LAN bandwidth, though.
       | 
       | Upshot - interesting paper -- smart ideas, large frontier models
       | still need very exotic hardware and bandwidth interconnects -
       | this may point a way forward to lowering the bandwidth
       | interconnects part of the story.
        
         | tgtweak wrote:
         | I think the main advantage here is you COULD run it, even it it
         | takes a while. That is a step up from current model limitations
         | which require ram or vram to hold the model.
         | 
         | I think this lays some groundwork for running a 400B model on a
         | 3090/4090 or even smaller GPU. If you can get a huge model like
         | that running on a single gpu even if the mean time per token is
         | in the seconds, that's acceptable for many use cases.
         | 
         | If this same technique can be used to extend context windows in
         | addition to token autocomplete, that would be great in it's own
         | regard.
         | 
         | Hopefully work like this continues as throwing a ton of vram at
         | a model should be regarded as a performance optimization not
         | necessarily a requirement.
        
           | michaelt wrote:
           | It's already technically possible to run huge models locally
           | when you don't have the RAM/VRAM needed - llama.cpp can
           | 'mmap' the model from disk.
           | 
           | Of course an nvidia 4090 has a memory bandwidth of a 1000 GB
           | per second; a CPU like the i7-13700K has a memory bandwidth
           | of 90 GB per second; and a high-end NVMe SSD might only have
           | read bandwidth of 10 GB per second.
           | 
           | So in approximate terms, an LLM and quantisation level that
           | can produce 10 tokens per second on a 4090 will produce 1
           | token per second from RAM and a token every 10 seconds from
           | SSD.
        
           | ignoramous wrote:
           | > _That is a step up from current model limitations which
           | require ram or vram to hold the model._
           | 
           | Current? Apple recently published a neat paper on how they
           | optimise for both inference (cpu/gpu) and memory use:
           | Our method involves constructing an inference cost model that
           | takes into account       the characteristics of flash memory,
           | guiding us to optimize in two critical areas:       reducing
           | the volume of data transferred from flash and reading data in
           | larger, more contiguous       chunks. Within this hardware-
           | informed framework, we introduce two principal techniques.
           | First, "windowing" strategically reduces data transfer by
           | reusing previously activated neurons,       and second, "row-
           | column bundling", tailored to the sequential data access
           | strengths       of flash memory, increases the size of data
           | chunks read from flash memory. These methods
           | collectively enable running models up to twice the size of
           | the available DRAM, with       up to 4x and 20x increase in
           | inference speed compared to naive loading approaches in CPU
           | and GPU, respectively.
           | 
           | https://news.ycombinator.com/item?id=38704982
        
           | diggan wrote:
           | > I think the main advantage here is you COULD run it, even
           | it it takes a while.
           | 
           | I mean, you COULD run it before as well, even if you don't
           | have enough RAM or VRAM, by using something like `zram`. It'd
           | probably be even slower (and border-line usable, depending on
           | the use case), but it's not impossible to get things to
           | _run_.
        
         | alchemist1e9 wrote:
         | > I think the paper makes the case that you could probably
         | recruit say your 30 graphics workstations to do much faster
         | inference without just nailing your LAN bandwidth, though.
         | 
         | Could be a big deal if it allows cluster of smaller GPUs to
         | compete with a single large VRAM GPU.
         | 
         | Unfortunately I'm a few months of date - which is an eternity
         | in LLM inference techniques - so I'm not sure what current
         | state of distributed inference looks like.
        
           | Palomides wrote:
           | llama.cpp supports splitting work across multiple nodes on a
           | network already
           | 
           | it essentially just copies a chunk of the model to each one,
           | works well for situations where each machine has limited vram
        
             | ay wrote:
             | Any pointers to RTFM / llama repo for that ? I could not
             | find anything on a cursory look. Thanks in advance !
        
               | Palomides wrote:
               | https://github.com/ggerganov/llama.cpp/tree/master/exampl
               | es/...
               | 
               | you run that on the remote nodes
        
           | vessenes wrote:
           | Yeah, I think these methods could be baked into llama.cpp or
           | some other higher up the toolchain python library or what
           | have you. They shard out each layer (ish?) to the edges, and
           | recombine that layer inference at the master node, while the
           | outside edges load up their next bit if they need to; I would
           | guess the devil is in the details for all the possible tensor
           | types and architectures (for instance, how shall we implement
           | skip layers?).
        
         | k1musab1 wrote:
         | Do you think this could allow distributed inference only, or
         | opens the door for distributed training of the model?
         | Democratization of the models is in part hampered by the total
         | compute a single person or small group can make use of, but if
         | a project like folding@home, but for training large models is
         | possible, it could change the game somewhat.
        
       | Zetaphor wrote:
       | Is this different from (or related to) the work being done by the
       | exo project?
       | 
       | https://github.com/exo-explore/exo
        
         | tgtweak wrote:
         | Exo is for partitioning over network across devices
         | (implementing some bandwidth-reducing partitions) but still
         | requires a minimum ram/vram requirement to load a model. This
         | could, in theory, be combined to allow larger models to run on
         | exo clusters with less gpu/ram than is required by the
         | underlying model (at the cost of some performance no doubt, but
         | still).
        
           | alexandercheema wrote:
           | exo maintainer here. tgtweak is correct.
           | 
           | This looks like potentially some promising research that I'm
           | looking into reproducing now. We want to lower the barrier to
           | running large models as much as possible so if this works, it
           | would be a potential addition to the exo offering.
        
       | tgtweak wrote:
       | Is there a cuda implementation of this... asking for a friend
        
       | adam_arthur wrote:
       | While I do think there's going to be a huge market for cloud-
       | based LLM serving, the fact that consumer hardware can run close
       | to SOTA models fairly easily (e.g. high-RAM MBP config), seems to
       | me that the provider market won't be as big as investors are
       | betting on.
       | 
       | Most of the rewards will be reaped by consumers rather than
       | providers.
       | 
       | We're also in an age where the current levels of RAM in consumer
       | devices were almost entirely optimized for prior to the existence
       | of LLMs. I find it highly likely vendors will optimize for higher
       | RAM capacity over other priorities in future hardware.
       | 
       | How long until a 256GB RAM laptop (shared with GPU) is reasonably
       | cheap/available? I give it a few years at most.
       | 
       | It's possible that models grow orders of magnitude larger, but I
       | find it more likely that the size of models will grow along the
       | curve of cost of training decreasing/hardware cost improvements.
       | There will be a sweet spot where it's economical to train larger
       | models, and private companies won't push much beyond that.
        
         | lxgr wrote:
         | Enterprises use LLMs too, and quite often there wouldn't be any
         | client you could reasonably run the model on. (You wouldn't
         | want to e.g. have an LLM summarize and categorize a user
         | request on _their_ device, since that would require you
         | shipping your model and /or internal knowledge base to the
         | client).
        
           | adam_arthur wrote:
           | Yes, but if you can run a sufficient LLM on a $2,000 laptop,
           | then the cost to serve it from the cloud will be similarly
           | cheap. (e.g. reserve an appropriately sized EC2 instance for
           | pennies on the dollar)
           | 
           | It's a highly competitive market. Companies aren't going to
           | pay 100k/year to run a model on something that can run on a
           | 2k consumer grade device.
           | 
           | 128GB of gpu accessible/fast RAM can be had for $5000 on a
           | macbook pro today. What will it be 3-4 years from now on
           | linux/windows machines?
           | 
           | And we still haven't seen any SoC providers try to optimize
           | for RAM capacity over compute yet.
        
             | lxgr wrote:
             | Oh yes, I could definitely see the privacy-preserving
             | consumer use case creating sufficient demand for efficiency
             | that also bleeds over into the enterprise market.
             | 
             | That's what's happened with power efficiency and ARM CPUs,
             | after all!
        
               | adunsulag wrote:
               | This is where I want highly sensitive healthcare
               | consumers of LLMs to be at. Note summation, suggested
               | diagnosis (provider always in control), and other
               | augmented abilities for the clinical staff without the
               | risk of health care data sent outside the device, or the
               | very local network.
        
               | adam_arthur wrote:
               | Not sure what you mean:
               | https://aws.amazon.com/ec2/graviton/
               | 
               | Not to speak of managed cloud services that run on ARM
               | under-the-hood/behind the scenes.
               | 
               | Of course ARM isn't inherently cheaper, AMD+Intel could
               | cut prices/margins big and probably be competitive on
               | $/perf
        
           | simsla wrote:
           | Depends, shipping part of it (just an encoder or decoder)
           | could still work.
        
             | lxgr wrote:
             | Even if bandwidth weren't an issue and all users had
             | compatible hardware: You'd still be offloading a
             | (semi-)trusted computation to user hardware, which is
             | usually completely untrusted.
        
       ___________________________________________________________________
       (page generated 2024-10-03 23:00 UTC)