https://arxiv.org/abs/2510.02361

close this message
arXiv smileybones

Happy Open Access Week from arXiv!

YOU make open access possible! Tell us why you support #openaccess
and give to arXiv this week to help keep science open for all.

Donate!
Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member
institutions, and all contributors. Donate
 
arxiv logo > cs > arXiv:2510.02361
[                    ]

Help | Advanced Search

[All fields        ]
Search
arXiv logo
Cornell University Logo
[                    ] GO
quick links

  * Login
  * Help Pages
  * About

Computer Science > Computation and Language

arXiv:2510.02361 (cs)
[Submitted on 28 Sep 2025]

Title:ChunkLLM: A Lightweight Pluggable Framework for Accelerating
LLMs Inference

Authors:Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang,
Fangxiang Feng
View a PDF of the paper titled ChunkLLM: A Lightweight Pluggable
Framework for Accelerating LLMs Inference, by Haojie Ouyang and 5
other authors
View PDF HTML (experimental)

    Abstract:Transformer-based large models excel in natural language
    processing and computer vision, but face severe computational
    inefficiencies due to the self-attention's quadratic complexity
    with input tokens. Recently, researchers have proposed a series
    of methods based on block selection and compression to alleviate
    this problem, but they either have issues with semantic
    incompleteness or poor training-inference efficiency. To
    comprehensively address these challenges, we propose ChunkLLM, a
    lightweight and pluggable training framework. Specifically, we
    introduce two components: QK Adapter (Q-Adapter and K-Adapter)
    and Chunk Adapter. The former is attached to each Transformer
    layer, serving dual purposes of feature compression and chunk
    attention acquisition. The latter operates at the bottommost
    layer of the model, functioning to detect chunk boundaries by
    leveraging contextual semantic information. During the training
    phase, the parameters of the backbone remain frozen, with only
    the QK Adapter and Chunk Adapter undergoing training. Notably, we
    design an attention distillation method for training the QK
    Adapter, which enhances the recall rate of key chunks. During the
    inference phase, chunk selection is triggered exclusively when
    the current token is detected as a chunk boundary, thereby
    accelerating model inference. Experimental evaluations are
    conducted on a diverse set of long-text and short-text benchmark
    datasets spanning multiple tasks. ChunkLLM not only attains
    comparable performance on short-text benchmarks but also
    maintains 98.64% of the performance on long-context benchmarks
    while preserving a 48.58% key-value cache retention rate.
    Particularly, ChunkLLM attains a maximum speedup of 4.48x in
    comparison to the vanilla Transformer in the processing of 120K
    long texts.

Subjects: Computation and Language (cs.CL); Artificial Intelligence
          (cs.AI)
Cite as:  arXiv:2510.02361 [cs.CL]
          (or arXiv:2510.02361v1 [cs.CL] for this version)
          https://doi.org/10.48550/arXiv.2510.02361
          Focus to learn more
          arXiv-issued DOI via DataCite

Submission history

From: Jianwei Lv [view email]
[v1] Sun, 28 Sep 2025 11:04:00 UTC (427 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled ChunkLLM: A Lightweight Pluggable
    Framework for Accelerating LLMs Inference, by Haojie Ouyang and 5
    other authors
  * View PDF
  * HTML (experimental)
  * TeX Source

view license
Current browse context:
cs.CL
< prev   |   next >
new | recent | 2025-10
Change to browse by:
cs
cs.AI

References & Citations

  * NASA ADS
  * Google Scholar
  * Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

x
[loading...          ]
Data provided by:

Bookmark

BibSonomy logo Reddit logo
(*) Bibliographic Tools

Bibliographic and Citation Tools

[ ] Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
[ ] Connected Papers Toggle
Connected Papers (What is Connected Papers?)
[ ] Litmaps Toggle
Litmaps (What is Litmaps?)
[ ] scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
( ) Code, Data, Media

Code, Data and Media Associated with this Article

[ ] alphaXiv Toggle
alphaXiv (What is alphaXiv?)
[ ] Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
[ ] DagsHub Toggle
DagsHub (What is DagsHub?)
[ ] GotitPub Toggle
Gotit.pub (What is GotitPub?)
[ ] Huggingface Toggle
Hugging Face (What is Huggingface?)
[ ] Links to Code Toggle
Papers with Code (What is Papers with Code?)
[ ] ScienceCast Toggle
ScienceCast (What is ScienceCast?)
( ) Demos

Demos

[ ] Replicate Toggle
Replicate (What is Replicate?)
[ ] Spaces Toggle
Hugging Face Spaces (What is Spaces?)
[ ] Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
( ) Related Papers

Recommenders and Search Tools

[ ] Link to Influence Flower
Influence Flower (What are Influence Flowers?)
[ ] Core recommender toggle
CORE Recommender (What is CORE?)

  * Author
  * Venue
  * Institution
  * Topic

( ) About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and
share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have
embraced and accepted our values of openness, community, excellence,
and user data privacy. arXiv is committed to these values and only
works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community?
Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is
MathJax?)

  * About
  * Help

  * Click here to contact arXiv Contact
  * Click here to subscribe Subscribe

  * Copyright
  * Privacy Policy

  * Web Accessibility Assistance
  * arXiv Operational Status