https://arxiv.org/abs/2504.11651

Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member
institutions, and all contributors. Donate
 
arxiv logo > cs > arXiv:2504.11651
[                    ]

Help | Advanced Search

[All fields        ]
Search
arXiv logo
Cornell University Logo
[                    ] GO
quick links

  * Login
  * Help Pages
  * About

Computer Science > Machine Learning

arXiv:2504.11651 (cs)
[Submitted on 15 Apr 2025]

Title:70% Size, 100% Accuracy: Lossless LLM Compression for Efficient
GPU Inference via Dynamic-Length Float

Authors:Tianyi Zhang, Yang Sui, Shaochen Zhong, Vipin Chaudhary, Xia
Hu, Anshumali Shrivastava
View a PDF of the paper titled 70% Size, 100% Accuracy: Lossless LLM
Compression for Efficient GPU Inference via Dynamic-Length Float, by
Tianyi Zhang and 5 other authors
View PDF HTML (experimental)

    Abstract:Large Language Models (LLMs) have grown rapidly in size,
    creating significant challenges for efficient deployment on
    resource-constrained hardware. In this paper, we introduce
    Dynamic-Length Float (DFloat11), a lossless compression framework
    that reduces LLM size by 30% while preserving outputs that are
    bit-for-bit identical to the original model. DFloat11 is
    motivated by the low entropy in the BFloat16 weight
    representation of LLMs, which reveals significant inefficiency in
    existing storage format. By applying entropy coding, DFloat11
    assigns dynamic-length encodings to weights based on frequency,
    achieving near information-optimal compression without any loss
    of precision. To facilitate efficient inference with
    dynamic-length encodings, we develop a custom GPU kernel for fast
    online decompression. Our design incorporates the following: (i)
    decomposition of memory-intensive lookup tables (LUTs) into
    compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for
    coordinating thread read/write positions using lightweight
    auxiliary variables, and (iii) transformer-block-level
    decompression to minimize latency. Experiments on recent models,
    including Llama-3.1, Qwen-2.5, and Gemma-3, validates our
    hypothesis that DFloat11 achieves around 30% model size reduction
    while preserving bit-for-bit exact outputs. Compared to a
    potential alternative of offloading parts of an uncompressed
    model to the CPU to meet memory constraints, DFloat11 achieves
    1.9-38.8x higher throughput in token generation. With a fixed GPU
    memory budget, DFloat11 enables 5.3-13.17x longer context lengths
    than uncompressed models. Notably, our method enables lossless
    inference of Llama-3.1-405B, an 810GB model, on a single node
    equipped with 8x80GB GPUs. Our code and models are available at
    this https URL.

Subjects: Machine Learning (cs.LG); Distributed, Parallel, and
          Cluster Computing (cs.DC)
Cite as:  arXiv:2504.11651 [cs.LG]
          (or arXiv:2504.11651v1 [cs.LG] for this version)
          https://doi.org/10.48550/arXiv.2504.11651
          Focus to learn more
          arXiv-issued DOI via DataCite

Submission history

From: Tianyi Zhang [view email]
[v1] Tue, 15 Apr 2025 22:38:38 UTC (242 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled 70% Size, 100% Accuracy: Lossless
    LLM Compression for Efficient GPU Inference via Dynamic-Length
    Float, by Tianyi Zhang and 5 other authors
  * View PDF
  * HTML (experimental)
  * TeX Source
  * Other Formats

license icon view license
Current browse context:
cs.LG
< prev   |   next >
new | recent | 2025-04
Change to browse by:
cs
cs.DC

References & Citations

  * NASA ADS
  * Google Scholar
  * Semantic Scholar

a export BibTeX citation Loading...

BibTeX formatted citation

x
[loading...          ]
Data provided by:

Bookmark

BibSonomy logo Reddit logo
(*) Bibliographic Tools

Bibliographic and Citation Tools

[ ] Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
[ ] Connected Papers Toggle
Connected Papers (What is Connected Papers?)
[ ] Litmaps Toggle
Litmaps (What is Litmaps?)
[ ] scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
( ) Code, Data, Media

Code, Data and Media Associated with this Article

[ ] alphaXiv Toggle
alphaXiv (What is alphaXiv?)
[ ] Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
[ ] DagsHub Toggle
DagsHub (What is DagsHub?)
[ ] GotitPub Toggle
Gotit.pub (What is GotitPub?)
[ ] Huggingface Toggle
Hugging Face (What is Huggingface?)
[ ] Links to Code Toggle
Papers with Code (What is Papers with Code?)
[ ] ScienceCast Toggle
ScienceCast (What is ScienceCast?)
( ) Demos

Demos

[ ] Replicate Toggle
Replicate (What is Replicate?)
[ ] Spaces Toggle
Hugging Face Spaces (What is Spaces?)
[ ] Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
( ) Related Papers

Recommenders and Search Tools

[ ] Link to Influence Flower
Influence Flower (What are Influence Flowers?)
[ ] Core recommender toggle
CORE Recommender (What is CORE?)
[ ] IArxiv recommender toggle
IArxiv Recommender (What is IArxiv?)

  * Author
  * Venue
  * Institution
  * Topic

( ) About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and
share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have
embraced and accepted our values of openness, community, excellence,
and user data privacy. arXiv is committed to these values and only
works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community?
Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is
MathJax?)

  * About
  * Help

  * Click here to contact arXiv Contact
  * Click here to subscribe Subscribe

  * Copyright
  * Privacy Policy

  * Web Accessibility Assistance
  * arXiv Operational Status
    Get status notifications via email or slack