hngopher.com

       [HN Gopher] Lossless LLM compression for efficient GPU inference...
       ___________________________________________________________________
        
       Lossless LLM compression for efficient GPU inference via dynamic-
       length float
        
       Author : CharlesW
       Score  : 217 points
       Date   : 2025-04-25 18:20 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | loufe wrote:
       | I'm so grateful to live through such exciting times. I can open
       | HN every two to some exciting new news about ML/transformer
       | models. I really should read more into it, but does llama.cpp use
       | a "custom kernel" per se, with cublas, or is it just making good
       | use of the cublas kernal?
        
         | jonplackett wrote:
         | It's funny that you're missing the time frame from your
         | sentence.
         | 
         | 2 weeks? Two months? Two days? Two minutes?
         | 
         | All of the above are true sometimes! Exciting times indeed.
        
       | iamnotagenius wrote:
       | Interesting, but not exactly practical for a local LLM user, as
       | 4-bit is how LLM's are run locally.
        
         | sroussey wrote:
         | True, but their research did include running on 5080 local.
         | 
         | The big take away, in my opinion, is that their technique for
         | LUTs etc could also be applied to lossy quants as well. Say
         | maybe you get 5bit accuracy in size of 4bit?
         | 
         | I don't know, but maybe? Also their two stage design might make
         | current quantized you kernal designs better.
        
           | spindump8930 wrote:
           | Yes, it could be stacked on quants. It might be that
           | quantized activations already are more "dense" and so they
           | can't be compressed as much (from 16 -> ~11 bits), but
           | certainly possible.
        
             | jasonjmcghee wrote:
             | I read it similarly - that this is a specific attribute of
             | bfloat16, so the quants folks tend to run on local hardware
             | don't have the same inefficiency to exploit
        
         | gojomo wrote:
         | Some might prefer the fidelity of this method's 70% savings
         | over the lossyness of 4-bit quantization's 75%.
         | 
         | And, maybe the methods stack for those willing to trade both
         | costs for the smallest representation.
        
           | svachalek wrote:
           | This is only a 30% savings, which is a cool technical feat
           | but hard to see a use case for.
        
       | Havoc wrote:
       | I'm guessing by lossless they mean something other than what the
       | word usually means in compression context?
       | 
       | >achieving near information-optimal compression without any loss
       | of precision
       | 
       | So perhaps more lossless as in didn't lose perplexity/benchmarks?
       | 
       | In my mind lossless is precisely zero bits lost along the way.
        
         | Vendan wrote:
         | information-optimal compression is "the theoretical minimum
         | number of bits needed to represent data without losing any
         | information, based on the data's entropy", so I think they mean
         | the same thing you do
        
           | brokencode wrote:
           | Yeah, they're saying that this compression is almost as good
           | as is theoretically possible without losing any information.
        
         | 8ytecoder wrote:
         | Think Morse code, where frequently used letters have shorter
         | codes than less frequent ones. This ensures zero loss of
         | information.
        
         | artemisart wrote:
         | The first sentence of the introduction ends with "we introduce
         | Dynamic-Length Float (DFloat11), a lossless compression
         | framework that reduces LLM size by 30% while preserving outputs
         | that are bit-for-bit identical to the original model" so yes
         | it's lossless.
        
         | ziddoap wrote:
         | The part you quote is a few sentences _past_ the sentence that
         | says _" preserving outputs that are bit-for-bit identical to
         | the original model"_.
        
         | vintermann wrote:
         | A good example that information, i.e. bits, are only meaningful
         | with respect to an end. If you don't know what the bits in a
         | float will be used to, you can't throw them away, but if the
         | floats are in a function, and you know that what some bits are
         | can't affect the output of the function regardless of input,
         | then you can throw those bits away and still have a lossless
         | compression of _the function_.
        
       | wills_forward wrote:
       | So this could universally decrease the memory requirements by un-
       | quantitized LLMs by 30%? Seems big if true.
        
         | moffkalast wrote:
         | Not as big when Q8 quantization is already considered overkill
         | and cuts it down to 50% (and a flat 2x speed boost without any
         | additional compute overhead mind you) and the more common Q4KM
         | is more like 30%. Definitely interesting if it can be added to
         | existing quantization, but K quants do already use different
         | precision levels for different layers depending on general
         | perplexity impact which is similar to this entropy metric they
         | use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not
         | even considering calibrated imatrix which does something
         | conceptually similar to FFT to compress even higher.
        
           | janalsncm wrote:
           | Quantization is not lossless.
        
             | danielmarkbruce wrote:
             | Nobody really cares if it meets a strict definition of
             | lossless.
        
               | moffkalast wrote:
               | And when you consider that the usual final step in the
               | pipeline is that a sampler goes ham on the probabilities
               | and just picks some random nonsense, the tolerance for
               | lossy compression is fairly high.
               | 
               | In fact, there's this funny occurrence where Q4 models on
               | occasion perform better than their fp16 counterparts on
               | benchmarks ran with top_k=1 since the outputs are
               | slightly more random and they can less deterministically
               | blunder past the local maximum into a more correct
               | solution.
        
               | kridsdale3 wrote:
               | That's not true. If there are measurable performance
               | differences.
        
               | kadushka wrote:
               | If you get any accuracy degradation with full 8 bits of
               | precision you're doing it wrong.
        
               | omneity wrote:
               | Or your model wasn't trained so well (weights are too
               | spiky)
        
               | danielmarkbruce wrote:
               | "strict" means something. People, including yourself,
               | only care if there is a practical difference in
               | performance. "this is lossless and that isn't lossless"
               | is a completely useless statement in this realm. In many
               | domains lossy compression is either not tolerated, not
               | legal or not practical.
        
               | throwaway314155 wrote:
               | Seems reductive.
        
               | BoorishBears wrote:
               | I do? I spend a ton of time post-training models for
               | creative tasks.
               | 
               | The effects of model quantization are usually qualified
               | in terms of performance on benchmaxxed tasks with strong
               | logit probabilities, temp 0, and a "right" answer the
               | model has to pick. Or even worse they'll be measured on
               | metrics that don't map to anything except themselves like
               | perplexity (https://arxiv.org/pdf/2407.09141)
               | 
               | I agree Q8 is strong but I also think the effects of
               | quantization are constantly being underappreciated.
               | People are often talking about how these models perform
               | while fundamentally using 10+ variants of a single model
               | with distinct performance profiles.
               | 
               | Even knowing the bits per weight used isn't enough to
               | know how exactly a given quant method is affecting the
               | model: https://docs.unsloth.ai/basics/unsloth-
               | dynamic-v2.0-ggufs
        
               | danielmarkbruce wrote:
               | "Nobody really cares if it meets a strict definition of
               | lossless" != "quantization can be done haphazardly."
        
               | BoorishBears wrote:
               | If you're trying to really snarkily refer to the article
               | on Dynamic Quants 2.0 and how carefully developed they
               | were, they're comparing their quants to the methodology
               | 99.99% quants out there use.
               | 
               | The problem is not that people are making quants
               | "haphazardly", it's that people keep parroting that
               | various quants are "practically lossless" when they
               | actually have absolutely no clue how lossy they are given
               | how application specific the concept is for something as
               | multidimensional as an LLM.
               | 
               | The moment anyone tries a little harder to quantify how
               | lossy they are, we repeatedly find that the answer is
               | "not any reasonably definition of lossless". Even in
               | their example where Q4 is <1% away in MMLU 5-shot is
               | probably massively helped by a calibration dataset that
               | maps to MMLU-style tasks really well, just like
               | constantly using WikiText massively helps models that
               | were trained on... tons of text from Wikipedia.
               | 
               | So unless you're doing your own calibrated quantization
               | with your own dataset (which is not impossible, but also
               | not near common), even their "non-haphazard" method could
               | have a noticeable impact on performance.
        
               | danielmarkbruce wrote:
               | Wasn't referring to that.
               | 
               | You are saying that people are using quantized models
               | haphazardly and talking about them haphazardly. I'll
               | grant it's not the exact same thing as making them
               | haphazardly, but I think you took the point.
               | 
               | The terms shouldn't be used here. They aren't helpful.
               | You are either getting good results or you are not. It
               | shouldn't be treated differently from further training on
               | dataset d. The weights changed - how much better or worse
               | at task Y did it just get?
        
               | BoorishBears wrote:
               | The term is perfectly fine to use here because choosing a
               | quantization strategy to deploy already has enough
               | variables:
               | 
               | - quality for your specific application
               | 
               | - time to first token
               | 
               | - inter-token latency
               | 
               | - memory usage (varies even for a given bits per weight)
               | 
               | - generation of hardware required to run
               | 
               | Of those the hardest to measure is consistently "quality
               | for your specific application".
               | 
               | It's _so_ hard to measure robustly that many will take
               | significantly worse performance on the other fronts just
               | to not have to try to measure it... which is how you end
               | up with full precision deployments of a 405b parameter
               | model: https://openrouter.ai/meta-
               | llama/llama-3.1-405b-instruct/pro...
               | 
               | When people are paying multiples more for compute to
               | side-step a problem, language and technology that allows
               | you to erase it from the equation is valid.
        
               | danielmarkbruce wrote:
               | You say that as though people know these things for the
               | full precision deployment and their use case.
               | 
               | Some have the capability to figure it and can do it for
               | both full precision and quantized. Most don't and cannot.
        
       | badmonster wrote:
       | What stands out most is the practical implication: enabling
       | lossless inference of a 405B-parameter model on a single node
       | with 8x80GB GPUs is wild. That's a huge unlock for research labs
       | and startups alike that want to run frontier models without
       | massive infrastructure costs.
        
         | danielmarkbruce wrote:
         | It's... useful right now...it's not a huge unlock in a world
         | where model size, GPU memory size, different precision support
         | are changing quickly.
        
           | striking wrote:
           | Is GPU memory size really changing that quickly? For that
           | matter, is model size?
        
             | kadushka wrote:
             | What's rapidly changing are quantization algorithms, and
             | hardware features to support those algorithms. For example,
             | Blackwell GPUs support dynamic FP4 quantization with group
             | size 16. At that group size it's close to lossless (in
             | terms of accuracy metrics).
        
             | danielmarkbruce wrote:
             | Yes, yes.
             | 
             | Nvidia about to release blackwell ultra with 288GB. Go back
             | to maybe 2018 and max was 16gb if memory serves.
             | 
             | DeepSeek recently release a 670 gb model. A couple years
             | ago Falcon's 180gb seemed huge.
        
               | spoaceman7777 wrote:
               | I'd assume that, in the context of LLM inference,
               | "recent" generally refers to the Ampere generation and
               | later of GPUs, when the demand for on board memory went
               | through the roof (as, the first truly usable LLMs were
               | trained on A100s).
               | 
               | We've been stuck with the same general caps on standard
               | GPU memory since then though. Perhaps limited in part
               | because of the generational upgrades happening in the
               | bandwidth of the memory, rather than the capacity.
        
               | danielmarkbruce wrote:
               | Bandwidth is going up too. "It's not doubling every 18
               | months and hence it's not moving" isn't a sensible way to
               | view change.
               | 
               | A one time effective 30% reduction in model size simply
               | isn't going to be some massive unlocker, in theory or in
               | practice.
        
             | latchkey wrote:
             | Both AMD and Nvidia are dumping more and more memory into
             | their GPUs.
             | 
             | MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be
             | 288 HBM3e (and support FP4/6).
        
               | NBJack wrote:
               | The professional side of things, yes. For consumer grade
               | GPUs, despite the trends in gaming markets otherwise
               | needing such, the values have stagnated a bit.
        
               | latchkey wrote:
               | I'm NDA with AMD and sadly can't mention details, but I
               | can say the future is promising.
        
         | miohtama wrote:
         | I am not expert here, so want to ask what's magical about 405B
         | number?
        
           | daveguy wrote:
           | That's the size of the largest, most capable, open source
           | models. Specifically Llama 3.1 has 405B parameters.
           | Deepseek's largest model is 671B parameters.
        
             | mhitza wrote:
             | Small corrections. Llama 3.1 is not an Open Source model,
             | but a Llama 3.1 Licensed model. Neither is DeepSeek
             | apparently https://huggingface.co/deepseek-
             | ai/DeepSeek-V3/blob/main/LIC... which I was of the false
             | opinion that it is. Though I never considered using it, so
             | haven't checked the license before.
        
         | latchkey wrote:
         | > That's a huge unlock for research labs and startups alike
         | that want to run frontier models without massive infrastructure
         | costs.
         | 
         | Or let one of the neoclouds take care of the infrastructure
         | costs and rent it out from them. Disclosure: I run one of them.
        
           | airstrike wrote:
           | Keep up the great work! We need more of you and other
           | players.
           | 
           | Some unsolicited feedback: I would suggest reworking your
           | landing page so that the language is always from your
           | customers' perspective. Your customers want to solve a real
           | internal problem that they have. Talking about how great your
           | company is will always have less impact than talking about
           | how you know what that problem is and how you intend to solve
           | it.
           | 
           | Your mission is relative to you and your investors, not to
           | your customers. They care about themselves.
           | 
           | Your "quick start" should be an interactive form. I shouldn't
           | have to remember what to put in an email to reach out to you.
           | Make it easy for me. Also move that to the front page,
           | provide a few "standard" packages and a custom one. Reduce
           | the friction to clicking the CTA.
           | 
           | Since your pricing is transparent, you should be able to tell
           | me what that price will be before I even submit a request. I
           | assume you're cheaper than the competition (otherwise why
           | would I not go with them?) so make that obvious. Check out
           | Backblaze's website for an example page:
           | https://www.backblaze.com/cloud-storage/pricing
           | 
           | Shell out a few grand and hire a designer to make your page
           | look more professional. Something like
           | https://oxide.computer/ but with the points above, as they
           | also make the same mistake of making their home page read
           | like a pitch deck.
        
             | latchkey wrote:
             | Fantastic unsolicited feedback, I'm definitely taking this
             | to heart!
             | 
             | Website is intended to be more like documentation instead
             | of a pitch deck or useless splash with a contact us form. I
             | dislike sites like Oxide, I scroll past and don't read or
             | ingest any of the fancy parts. Of course, you're right,
             | this probably needs to be less about me. =)
             | 
             | Friction definitely needs to be improved. That part is
             | being worked on right now. Our intention is to be fully
             | self-service, so that you don't have to talk to us at all,
             | unless you want to. Credit card and go.
             | 
             | We recently lowered our prices to be competitive with the
             | rest of the market vs. focusing on people who care more
             | about what we offer. We weren't trying to be cheaper than
             | everyone else, we were trying to offer a better service.
             | Lesson learned and pricing adjusted. Streisand effect, I
             | don't like to mention the other players much.
             | 
             | Again, thanks!
        
       | ein0p wrote:
       | Note that this is _way_ slower at small batch sizes you'd need
       | for interactive use. At batch size 1 this seems to run at 1/3rd
       | the speed of bf16 (so about 1/6th the speed of fp8 you'd
       | realistically be using) if figure 5 is to be believed. This is
       | actually a pretty impressive feat in itself if you know anything
       | about GPU kernel programming, but it is much slower nevertheless.
       | For this to work at "wire speed" it'd need hardware support,
       | which takes years. Their "baseline" elsewhere in the paper is CPU
       | offloading, which is dog slow and can't be made fast due to PCIe
       | bottleneck.
        
         | timschmidt wrote:
         | It's perfectly possible to run LLMs quickly on CPUs. An Epyc or
         | Xeon with 12 memory channels achieves similar memory bandwidth
         | to a 4090, which is the limiting factor. Engineering sample
         | Epycs in kits with motherboard and RAM are available on
         | Aliexpress for reasonable prices even.
        
           | ein0p wrote:
           | Did I say it wasn't? If your context is short and your model
           | is small, it is possible to run LLMs on high-end CPUs able to
           | support 12 channels of high-spec DDR5 RDIMMs. It's not
           | possible to run them as fast as they'd run on a GPU equipped
           | with HBM though. Nor would it be even remotely as energy
           | efficient. Also, it's not possible to run LLMs quickly on CPU
           | if your context is long, because CPUs do not have the
           | requisite FLOPS to process long context quickly. And before
           | you bring MoE into the conversation, MoE only affects the
           | feedforward part of each transformer block, and full memory
           | bandwidth and compute savings are only realized at batch size
           | 1, sequence length 1, AKA the most inefficient mode that
           | nobody other than Ollama users use in practice. Sequence
           | length 8 (common for speculative decoding) could be using up
           | to 8x37B parameters (assuming you want to run DeepSeek - the
           | strongest available open weights model). Batch size of even 2
           | with sequence length 8 could use almost all parameters if
           | you're particularly unlucky. Prompt will almost certainly use
           | all parameters, and will slam into the FLOPS wall of your
           | EPYC's ALUs. So can LLMs (with an emphasis on "Large") be run
           | on CPUs? Yes. Are you going to have a good time running them
           | this way? No.
        
             | timschmidt wrote:
             | llamafile contains specific optimizations for prompt
             | processing using AVX512 for dealing with just this issue:
             | https://justine.lol/matmul/ (about a 10x speedup over
             | llama.cpp)
             | 
             | Somewhere between 8 and 192 cores I'm sure there's enough
             | AVX512 to get the job done. And we've managed to reinvent
             | Intel's Larrabee / Knights concept.
             | 
             | Sadly, the highly optimized AVX512 kernels of llamafile
             | don't support these exotic floats yet as far as I know.
             | 
             | Yes, energy efficiency per query will be terrible compared
             | to a hyperscaler. However privacy will be perfect.
             | Flexibility will be higher than other options - as running
             | on the CPU is almost always possible. Even with new
             | algorithms and experimental models.
        
               | ein0p wrote:
               | At 192 cores you're way better off buying a Mac Studio,
               | though.
        
         | ow5 wrote:
         | Hi! one of the contributors to the paper -- we have kernels not
         | released yet that can shave down decoding latency by >20%.
         | 
         | Also when we ran experiments for streaming with the current
         | kernels, we were median ~1.3x slower at inference
        
           | ein0p wrote:
           | Thanks for chiming in! How do you explain the top-most graph
           | in Figure 5? Am I misreading it?
        
       | mountainriver wrote:
       | Is it possible to run this on new models? It seem like the code
       | is only for inference, unless I'm misunderstanding
        
       | marksimi wrote:
       | Time to (dynamically) float
        
       | luotuoshangdui wrote:
       | Does it affect speed?
        
       | hchja wrote:
       | This is pretty useless in any case that doesn't involve BFloat16
       | models
        
         | throwaway314155 wrote:
         | So an increasingly smaller number of cases?
        
         | spindump8930 wrote:
         | bf16 is the defacto default datatype and distribution type for
         | LLMs, which are then often eagerly quantized by users with more
         | limited hardware. See the recent Llama releases and e.g. the
         | H100 spec sheet (advertised flops and metrics target bf16).
        
       | yjftsjthsd-h wrote:
       | > Compared to a potential alternative of offloading parts of an
       | uncompressed model to the CPU to meet memory constraints,
       | DFloat11 achieves 1.9-38.8x higher throughput in token
       | generation. With a fixed GPU memory budget, DFloat11 enables
       | 5.3-13.17x longer context lengths than uncompressed models.
       | 
       | The context length alone probably makes it worthwhile even if
       | your models fit in memory, but I'm curious if it improves
       | tokens/sec even all on GPU, since _in my very amateur
       | understanding_ LLMs tend to be constrained by memory bandwidth?
        
         | philjohn wrote:
         | My mental model is saying it might do, much like on slow hard
         | drives DoubleSpace in DOS slightly sped up loading data from
         | disk.
        
         | hnuser123456 wrote:
         | If the model is 70% the size, it will be 1/0.7 = 1.43x the
         | speed.
        
       | anticensor wrote:
       | This is just a VBR mode for neural networks. Not quite useful
       | when inference is already quite slow.
        
       | jhj wrote:
       | This is just a consequence of the fact that bfloat16 has a very
       | high dynamic range which is not all used. People like
       | hyperparameters that look like 0.01 not 10^10, even though there
       | is the same fractional precision available at each exponent and
       | if you multiplied everything - hyperparameters, initialized
       | weights, training data, etc in a network by 10^6 things will
       | still work more or less the same since the upper range is hardly
       | used (with the possible exception of some small number of special
       | functions).
       | 
       | Typical entropy of bfloat16 values seen in weights (and
       | activations) are about 10-12 bits (only 65-75% or so of the value
       | range is used in practice). Sign and mantissa bits tend to be
       | incompressible noise.
       | 
       | This has been exploited several times before in the context of
       | both classical HPC and AI, with lossless compression work from
       | Martin Burtscher's lab
       | (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL
       | (https://computing.llnl.gov/projects/fpzip) and my library
       | dietgpu from 2021 (https://github.com/facebookresearch/dietgpu)
       | which we used to speed training on a large GPU cluster by about
       | 10% wall clock time overall by losslessly compressing all data
       | prior to send and decompressing upon receive (e.g., gradients,
       | weights from backup, etc), which is still computing the same
       | thing as it did before as it is lossless.
       | 
       | Also, rANS is more efficient and easier to implement in SIMD-like
       | instruction sets than Huffman coding. It would reduce the
       | performance latency/throughput penalties as well with DFloat11
       | (since we have to decompress before we do the arithmetic).
        
         | iandanforth wrote:
         | For those who don't bother to click through profiles, Jeff
         | _really_ knows what he 's talking about. Much of Meta/FAIR +
         | community benefits from his code.
        
       | Animats wrote:
       | Once this weight format war settles down, hardware can be built
       | to support it. Presumably you want matrix multiply hardware
       | optimized for whatever weight format turns out to be reasonably
       | optimal.
        
         | eoerl wrote:
         | Optimization is post hoc here : you have to train first to be
         | able to huffman en ode, so it's not a pure format question
        
       | aazo11 wrote:
       | This is a huge unlock for on-device inference. The download time
       | of larger models makes local inference unusable for non-technical
       | users.
        
       | aseligman wrote:
       | Some additional context: many real world agent use cases struggle
       | to balance quality, cost, and performance. This technique can
       | help avoid the tradeoffs that quantization techniques introduce,
       | including unpredictable results while you try cost optimize an
       | agent. In some cases the cost savings can be significant using
       | dfloat11 as you squeeze into more adorable GPUs.
       | 
       | * I work with xmad.ai
        
       ___________________________________________________________________
       (page generated 2025-04-25 23:00 UTC)