[HN Gopher] Lossless LLM compression for efficient GPU inference...
___________________________________________________________________
Lossless LLM compression for efficient GPU inference via dynamic-
length float
Author : CharlesW
Score : 217 points
Date : 2025-04-25 18:20 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| loufe wrote:
| I'm so grateful to live through such exciting times. I can open
| HN every two to some exciting new news about ML/transformer
| models. I really should read more into it, but does llama.cpp use
| a "custom kernel" per se, with cublas, or is it just making good
| use of the cublas kernal?
| jonplackett wrote:
| It's funny that you're missing the time frame from your
| sentence.
|
| 2 weeks? Two months? Two days? Two minutes?
|
| All of the above are true sometimes! Exciting times indeed.
| iamnotagenius wrote:
| Interesting, but not exactly practical for a local LLM user, as
| 4-bit is how LLM's are run locally.
| sroussey wrote:
| True, but their research did include running on 5080 local.
|
| The big take away, in my opinion, is that their technique for
| LUTs etc could also be applied to lossy quants as well. Say
| maybe you get 5bit accuracy in size of 4bit?
|
| I don't know, but maybe? Also their two stage design might make
| current quantized you kernal designs better.
| spindump8930 wrote:
| Yes, it could be stacked on quants. It might be that
| quantized activations already are more "dense" and so they
| can't be compressed as much (from 16 -> ~11 bits), but
| certainly possible.
| jasonjmcghee wrote:
| I read it similarly - that this is a specific attribute of
| bfloat16, so the quants folks tend to run on local hardware
| don't have the same inefficiency to exploit
| gojomo wrote:
| Some might prefer the fidelity of this method's 70% savings
| over the lossyness of 4-bit quantization's 75%.
|
| And, maybe the methods stack for those willing to trade both
| costs for the smallest representation.
| svachalek wrote:
| This is only a 30% savings, which is a cool technical feat
| but hard to see a use case for.
| Havoc wrote:
| I'm guessing by lossless they mean something other than what the
| word usually means in compression context?
|
| >achieving near information-optimal compression without any loss
| of precision
|
| So perhaps more lossless as in didn't lose perplexity/benchmarks?
|
| In my mind lossless is precisely zero bits lost along the way.
| Vendan wrote:
| information-optimal compression is "the theoretical minimum
| number of bits needed to represent data without losing any
| information, based on the data's entropy", so I think they mean
| the same thing you do
| brokencode wrote:
| Yeah, they're saying that this compression is almost as good
| as is theoretically possible without losing any information.
| 8ytecoder wrote:
| Think Morse code, where frequently used letters have shorter
| codes than less frequent ones. This ensures zero loss of
| information.
| artemisart wrote:
| The first sentence of the introduction ends with "we introduce
| Dynamic-Length Float (DFloat11), a lossless compression
| framework that reduces LLM size by 30% while preserving outputs
| that are bit-for-bit identical to the original model" so yes
| it's lossless.
| ziddoap wrote:
| The part you quote is a few sentences _past_ the sentence that
| says _" preserving outputs that are bit-for-bit identical to
| the original model"_.
| vintermann wrote:
| A good example that information, i.e. bits, are only meaningful
| with respect to an end. If you don't know what the bits in a
| float will be used to, you can't throw them away, but if the
| floats are in a function, and you know that what some bits are
| can't affect the output of the function regardless of input,
| then you can throw those bits away and still have a lossless
| compression of _the function_.
| wills_forward wrote:
| So this could universally decrease the memory requirements by un-
| quantitized LLMs by 30%? Seems big if true.
| moffkalast wrote:
| Not as big when Q8 quantization is already considered overkill
| and cuts it down to 50% (and a flat 2x speed boost without any
| additional compute overhead mind you) and the more common Q4KM
| is more like 30%. Definitely interesting if it can be added to
| existing quantization, but K quants do already use different
| precision levels for different layers depending on general
| perplexity impact which is similar to this entropy metric they
| use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not
| even considering calibrated imatrix which does something
| conceptually similar to FFT to compress even higher.
| janalsncm wrote:
| Quantization is not lossless.
| danielmarkbruce wrote:
| Nobody really cares if it meets a strict definition of
| lossless.
| moffkalast wrote:
| And when you consider that the usual final step in the
| pipeline is that a sampler goes ham on the probabilities
| and just picks some random nonsense, the tolerance for
| lossy compression is fairly high.
|
| In fact, there's this funny occurrence where Q4 models on
| occasion perform better than their fp16 counterparts on
| benchmarks ran with top_k=1 since the outputs are
| slightly more random and they can less deterministically
| blunder past the local maximum into a more correct
| solution.
| kridsdale3 wrote:
| That's not true. If there are measurable performance
| differences.
| kadushka wrote:
| If you get any accuracy degradation with full 8 bits of
| precision you're doing it wrong.
| omneity wrote:
| Or your model wasn't trained so well (weights are too
| spiky)
| danielmarkbruce wrote:
| "strict" means something. People, including yourself,
| only care if there is a practical difference in
| performance. "this is lossless and that isn't lossless"
| is a completely useless statement in this realm. In many
| domains lossy compression is either not tolerated, not
| legal or not practical.
| throwaway314155 wrote:
| Seems reductive.
| BoorishBears wrote:
| I do? I spend a ton of time post-training models for
| creative tasks.
|
| The effects of model quantization are usually qualified
| in terms of performance on benchmaxxed tasks with strong
| logit probabilities, temp 0, and a "right" answer the
| model has to pick. Or even worse they'll be measured on
| metrics that don't map to anything except themselves like
| perplexity (https://arxiv.org/pdf/2407.09141)
|
| I agree Q8 is strong but I also think the effects of
| quantization are constantly being underappreciated.
| People are often talking about how these models perform
| while fundamentally using 10+ variants of a single model
| with distinct performance profiles.
|
| Even knowing the bits per weight used isn't enough to
| know how exactly a given quant method is affecting the
| model: https://docs.unsloth.ai/basics/unsloth-
| dynamic-v2.0-ggufs
| danielmarkbruce wrote:
| "Nobody really cares if it meets a strict definition of
| lossless" != "quantization can be done haphazardly."
| BoorishBears wrote:
| If you're trying to really snarkily refer to the article
| on Dynamic Quants 2.0 and how carefully developed they
| were, they're comparing their quants to the methodology
| 99.99% quants out there use.
|
| The problem is not that people are making quants
| "haphazardly", it's that people keep parroting that
| various quants are "practically lossless" when they
| actually have absolutely no clue how lossy they are given
| how application specific the concept is for something as
| multidimensional as an LLM.
|
| The moment anyone tries a little harder to quantify how
| lossy they are, we repeatedly find that the answer is
| "not any reasonably definition of lossless". Even in
| their example where Q4 is <1% away in MMLU 5-shot is
| probably massively helped by a calibration dataset that
| maps to MMLU-style tasks really well, just like
| constantly using WikiText massively helps models that
| were trained on... tons of text from Wikipedia.
|
| So unless you're doing your own calibrated quantization
| with your own dataset (which is not impossible, but also
| not near common), even their "non-haphazard" method could
| have a noticeable impact on performance.
| danielmarkbruce wrote:
| Wasn't referring to that.
|
| You are saying that people are using quantized models
| haphazardly and talking about them haphazardly. I'll
| grant it's not the exact same thing as making them
| haphazardly, but I think you took the point.
|
| The terms shouldn't be used here. They aren't helpful.
| You are either getting good results or you are not. It
| shouldn't be treated differently from further training on
| dataset d. The weights changed - how much better or worse
| at task Y did it just get?
| BoorishBears wrote:
| The term is perfectly fine to use here because choosing a
| quantization strategy to deploy already has enough
| variables:
|
| - quality for your specific application
|
| - time to first token
|
| - inter-token latency
|
| - memory usage (varies even for a given bits per weight)
|
| - generation of hardware required to run
|
| Of those the hardest to measure is consistently "quality
| for your specific application".
|
| It's _so_ hard to measure robustly that many will take
| significantly worse performance on the other fronts just
| to not have to try to measure it... which is how you end
| up with full precision deployments of a 405b parameter
| model: https://openrouter.ai/meta-
| llama/llama-3.1-405b-instruct/pro...
|
| When people are paying multiples more for compute to
| side-step a problem, language and technology that allows
| you to erase it from the equation is valid.
| danielmarkbruce wrote:
| You say that as though people know these things for the
| full precision deployment and their use case.
|
| Some have the capability to figure it and can do it for
| both full precision and quantized. Most don't and cannot.
| badmonster wrote:
| What stands out most is the practical implication: enabling
| lossless inference of a 405B-parameter model on a single node
| with 8x80GB GPUs is wild. That's a huge unlock for research labs
| and startups alike that want to run frontier models without
| massive infrastructure costs.
| danielmarkbruce wrote:
| It's... useful right now...it's not a huge unlock in a world
| where model size, GPU memory size, different precision support
| are changing quickly.
| striking wrote:
| Is GPU memory size really changing that quickly? For that
| matter, is model size?
| kadushka wrote:
| What's rapidly changing are quantization algorithms, and
| hardware features to support those algorithms. For example,
| Blackwell GPUs support dynamic FP4 quantization with group
| size 16. At that group size it's close to lossless (in
| terms of accuracy metrics).
| danielmarkbruce wrote:
| Yes, yes.
|
| Nvidia about to release blackwell ultra with 288GB. Go back
| to maybe 2018 and max was 16gb if memory serves.
|
| DeepSeek recently release a 670 gb model. A couple years
| ago Falcon's 180gb seemed huge.
| spoaceman7777 wrote:
| I'd assume that, in the context of LLM inference,
| "recent" generally refers to the Ampere generation and
| later of GPUs, when the demand for on board memory went
| through the roof (as, the first truly usable LLMs were
| trained on A100s).
|
| We've been stuck with the same general caps on standard
| GPU memory since then though. Perhaps limited in part
| because of the generational upgrades happening in the
| bandwidth of the memory, rather than the capacity.
| danielmarkbruce wrote:
| Bandwidth is going up too. "It's not doubling every 18
| months and hence it's not moving" isn't a sensible way to
| view change.
|
| A one time effective 30% reduction in model size simply
| isn't going to be some massive unlocker, in theory or in
| practice.
| latchkey wrote:
| Both AMD and Nvidia are dumping more and more memory into
| their GPUs.
|
| MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be
| 288 HBM3e (and support FP4/6).
| NBJack wrote:
| The professional side of things, yes. For consumer grade
| GPUs, despite the trends in gaming markets otherwise
| needing such, the values have stagnated a bit.
| latchkey wrote:
| I'm NDA with AMD and sadly can't mention details, but I
| can say the future is promising.
| miohtama wrote:
| I am not expert here, so want to ask what's magical about 405B
| number?
| daveguy wrote:
| That's the size of the largest, most capable, open source
| models. Specifically Llama 3.1 has 405B parameters.
| Deepseek's largest model is 671B parameters.
| mhitza wrote:
| Small corrections. Llama 3.1 is not an Open Source model,
| but a Llama 3.1 Licensed model. Neither is DeepSeek
| apparently https://huggingface.co/deepseek-
| ai/DeepSeek-V3/blob/main/LIC... which I was of the false
| opinion that it is. Though I never considered using it, so
| haven't checked the license before.
| latchkey wrote:
| > That's a huge unlock for research labs and startups alike
| that want to run frontier models without massive infrastructure
| costs.
|
| Or let one of the neoclouds take care of the infrastructure
| costs and rent it out from them. Disclosure: I run one of them.
| airstrike wrote:
| Keep up the great work! We need more of you and other
| players.
|
| Some unsolicited feedback: I would suggest reworking your
| landing page so that the language is always from your
| customers' perspective. Your customers want to solve a real
| internal problem that they have. Talking about how great your
| company is will always have less impact than talking about
| how you know what that problem is and how you intend to solve
| it.
|
| Your mission is relative to you and your investors, not to
| your customers. They care about themselves.
|
| Your "quick start" should be an interactive form. I shouldn't
| have to remember what to put in an email to reach out to you.
| Make it easy for me. Also move that to the front page,
| provide a few "standard" packages and a custom one. Reduce
| the friction to clicking the CTA.
|
| Since your pricing is transparent, you should be able to tell
| me what that price will be before I even submit a request. I
| assume you're cheaper than the competition (otherwise why
| would I not go with them?) so make that obvious. Check out
| Backblaze's website for an example page:
| https://www.backblaze.com/cloud-storage/pricing
|
| Shell out a few grand and hire a designer to make your page
| look more professional. Something like
| https://oxide.computer/ but with the points above, as they
| also make the same mistake of making their home page read
| like a pitch deck.
| latchkey wrote:
| Fantastic unsolicited feedback, I'm definitely taking this
| to heart!
|
| Website is intended to be more like documentation instead
| of a pitch deck or useless splash with a contact us form. I
| dislike sites like Oxide, I scroll past and don't read or
| ingest any of the fancy parts. Of course, you're right,
| this probably needs to be less about me. =)
|
| Friction definitely needs to be improved. That part is
| being worked on right now. Our intention is to be fully
| self-service, so that you don't have to talk to us at all,
| unless you want to. Credit card and go.
|
| We recently lowered our prices to be competitive with the
| rest of the market vs. focusing on people who care more
| about what we offer. We weren't trying to be cheaper than
| everyone else, we were trying to offer a better service.
| Lesson learned and pricing adjusted. Streisand effect, I
| don't like to mention the other players much.
|
| Again, thanks!
| ein0p wrote:
| Note that this is _way_ slower at small batch sizes you'd need
| for interactive use. At batch size 1 this seems to run at 1/3rd
| the speed of bf16 (so about 1/6th the speed of fp8 you'd
| realistically be using) if figure 5 is to be believed. This is
| actually a pretty impressive feat in itself if you know anything
| about GPU kernel programming, but it is much slower nevertheless.
| For this to work at "wire speed" it'd need hardware support,
| which takes years. Their "baseline" elsewhere in the paper is CPU
| offloading, which is dog slow and can't be made fast due to PCIe
| bottleneck.
| timschmidt wrote:
| It's perfectly possible to run LLMs quickly on CPUs. An Epyc or
| Xeon with 12 memory channels achieves similar memory bandwidth
| to a 4090, which is the limiting factor. Engineering sample
| Epycs in kits with motherboard and RAM are available on
| Aliexpress for reasonable prices even.
| ein0p wrote:
| Did I say it wasn't? If your context is short and your model
| is small, it is possible to run LLMs on high-end CPUs able to
| support 12 channels of high-spec DDR5 RDIMMs. It's not
| possible to run them as fast as they'd run on a GPU equipped
| with HBM though. Nor would it be even remotely as energy
| efficient. Also, it's not possible to run LLMs quickly on CPU
| if your context is long, because CPUs do not have the
| requisite FLOPS to process long context quickly. And before
| you bring MoE into the conversation, MoE only affects the
| feedforward part of each transformer block, and full memory
| bandwidth and compute savings are only realized at batch size
| 1, sequence length 1, AKA the most inefficient mode that
| nobody other than Ollama users use in practice. Sequence
| length 8 (common for speculative decoding) could be using up
| to 8x37B parameters (assuming you want to run DeepSeek - the
| strongest available open weights model). Batch size of even 2
| with sequence length 8 could use almost all parameters if
| you're particularly unlucky. Prompt will almost certainly use
| all parameters, and will slam into the FLOPS wall of your
| EPYC's ALUs. So can LLMs (with an emphasis on "Large") be run
| on CPUs? Yes. Are you going to have a good time running them
| this way? No.
| timschmidt wrote:
| llamafile contains specific optimizations for prompt
| processing using AVX512 for dealing with just this issue:
| https://justine.lol/matmul/ (about a 10x speedup over
| llama.cpp)
|
| Somewhere between 8 and 192 cores I'm sure there's enough
| AVX512 to get the job done. And we've managed to reinvent
| Intel's Larrabee / Knights concept.
|
| Sadly, the highly optimized AVX512 kernels of llamafile
| don't support these exotic floats yet as far as I know.
|
| Yes, energy efficiency per query will be terrible compared
| to a hyperscaler. However privacy will be perfect.
| Flexibility will be higher than other options - as running
| on the CPU is almost always possible. Even with new
| algorithms and experimental models.
| ein0p wrote:
| At 192 cores you're way better off buying a Mac Studio,
| though.
| ow5 wrote:
| Hi! one of the contributors to the paper -- we have kernels not
| released yet that can shave down decoding latency by >20%.
|
| Also when we ran experiments for streaming with the current
| kernels, we were median ~1.3x slower at inference
| ein0p wrote:
| Thanks for chiming in! How do you explain the top-most graph
| in Figure 5? Am I misreading it?
| mountainriver wrote:
| Is it possible to run this on new models? It seem like the code
| is only for inference, unless I'm misunderstanding
| marksimi wrote:
| Time to (dynamically) float
| luotuoshangdui wrote:
| Does it affect speed?
| hchja wrote:
| This is pretty useless in any case that doesn't involve BFloat16
| models
| throwaway314155 wrote:
| So an increasingly smaller number of cases?
| spindump8930 wrote:
| bf16 is the defacto default datatype and distribution type for
| LLMs, which are then often eagerly quantized by users with more
| limited hardware. See the recent Llama releases and e.g. the
| H100 spec sheet (advertised flops and metrics target bf16).
| yjftsjthsd-h wrote:
| > Compared to a potential alternative of offloading parts of an
| uncompressed model to the CPU to meet memory constraints,
| DFloat11 achieves 1.9-38.8x higher throughput in token
| generation. With a fixed GPU memory budget, DFloat11 enables
| 5.3-13.17x longer context lengths than uncompressed models.
|
| The context length alone probably makes it worthwhile even if
| your models fit in memory, but I'm curious if it improves
| tokens/sec even all on GPU, since _in my very amateur
| understanding_ LLMs tend to be constrained by memory bandwidth?
| philjohn wrote:
| My mental model is saying it might do, much like on slow hard
| drives DoubleSpace in DOS slightly sped up loading data from
| disk.
| hnuser123456 wrote:
| If the model is 70% the size, it will be 1/0.7 = 1.43x the
| speed.
| anticensor wrote:
| This is just a VBR mode for neural networks. Not quite useful
| when inference is already quite slow.
| jhj wrote:
| This is just a consequence of the fact that bfloat16 has a very
| high dynamic range which is not all used. People like
| hyperparameters that look like 0.01 not 10^10, even though there
| is the same fractional precision available at each exponent and
| if you multiplied everything - hyperparameters, initialized
| weights, training data, etc in a network by 10^6 things will
| still work more or less the same since the upper range is hardly
| used (with the possible exception of some small number of special
| functions).
|
| Typical entropy of bfloat16 values seen in weights (and
| activations) are about 10-12 bits (only 65-75% or so of the value
| range is used in practice). Sign and mantissa bits tend to be
| incompressible noise.
|
| This has been exploited several times before in the context of
| both classical HPC and AI, with lossless compression work from
| Martin Burtscher's lab
| (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL
| (https://computing.llnl.gov/projects/fpzip) and my library
| dietgpu from 2021 (https://github.com/facebookresearch/dietgpu)
| which we used to speed training on a large GPU cluster by about
| 10% wall clock time overall by losslessly compressing all data
| prior to send and decompressing upon receive (e.g., gradients,
| weights from backup, etc), which is still computing the same
| thing as it did before as it is lossless.
|
| Also, rANS is more efficient and easier to implement in SIMD-like
| instruction sets than Huffman coding. It would reduce the
| performance latency/throughput penalties as well with DFloat11
| (since we have to decompress before we do the arithmetic).
| iandanforth wrote:
| For those who don't bother to click through profiles, Jeff
| _really_ knows what he 's talking about. Much of Meta/FAIR +
| community benefits from his code.
| Animats wrote:
| Once this weight format war settles down, hardware can be built
| to support it. Presumably you want matrix multiply hardware
| optimized for whatever weight format turns out to be reasonably
| optimal.
| eoerl wrote:
| Optimization is post hoc here : you have to train first to be
| able to huffman en ode, so it's not a pure format question
| aazo11 wrote:
| This is a huge unlock for on-device inference. The download time
| of larger models makes local inference unusable for non-technical
| users.
| aseligman wrote:
| Some additional context: many real world agent use cases struggle
| to balance quality, cost, and performance. This technique can
| help avoid the tradeoffs that quantization techniques introduce,
| including unpredictable results while you try cost optimize an
| agent. In some cases the cost savings can be significant using
| dfloat11 as you squeeze into more adorable GPUs.
|
| * I work with xmad.ai
___________________________________________________________________
(page generated 2025-04-25 23:00 UTC)