https://arxiv.org/abs/2410.00531

2024-10-01: arxiv.org is back to normal.

Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member
institutions, and all contributors. Donate
 
arxiv logo > cs > arXiv:2410.00531
[                    ]

Help | Advanced Search

[All fields        ]
Search
arXiv logo
Cornell University Logo
[                    ] GO
quick links

  * Login
  * Help Pages
  * About

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2410.00531 (cs)
[Submitted on 1 Oct 2024]

Title:TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource
Edge Devices

Authors:Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu
View a PDF of the paper titled TPI-LLM: Serving 70B-scale LLMs
Efficiently on Low-resource Edge Devices, by Zonghang Li and Wenjiao
Feng and Mohsen Guizani and Hongfang Yu
View PDF HTML (experimental)

    Abstract:Large model inference is shifting from cloud to edge due
    to concerns about the privacy of user interaction data. However,
    edge devices often struggle with limited computing power, memory,
    and bandwidth, requiring collaboration across multiple devices to
    run and speed up LLM inference. Pipeline parallelism, the
    mainstream solution, is inefficient for single-user scenarios,
    while tensor parallelism struggles with frequent communications.
    In this paper, we argue that tensor parallelism can be more
    effective than pipeline on low-resource devices, and present a
    compute- and memory-efficient tensor parallel inference system,
    named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive
    raw data local in the users' devices and introduces a sliding
    window memory scheduler to dynamically manage layer weights
    during inference, with disk I/O latency overlapped with the
    computation and communication. This allows larger models to run
    smoothly on memory-limited devices. We analyze the communication
    bottleneck and find that link latency, not bandwidth, emerges as
    the main issue, so a star-based allreduce algorithm is
    implemented. Through extensive experiments on both emulated and
    real testbeds, TPI-LLM demonstrated over 80% less
    time-to-first-token and token latency compared to Accelerate, and
    over 90% compared to Transformers and Galaxy, while cutting the
    peak memory footprint of Llama 2-70B by 90%, requiring only 3.1
    GB of memory for 70B-scale models.

Comments:    This paper is currently under review. Find the code at
             this https URL
Subjects:    Distributed, Parallel, and Cluster Computing (cs.DC);
             Artificial Intelligence (cs.AI)
MSC classes: 68T50
ACM classes: I.2.11
Cite as:     arXiv:2410.00531 [cs.DC]
             (or arXiv:2410.00531v1 [cs.DC] for this version)
             https://doi.org/10.48550/arXiv.2410.00531
             Focus to learn more
             arXiv-issued DOI via DataCite

Submission history

From: Zonghang Li [view email]
[v1] Tue, 1 Oct 2024 09:18:56 UTC (3,886 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled TPI-LLM: Serving 70B-scale LLMs
    Efficiently on Low-resource Edge Devices, by Zonghang Li and
    Wenjiao Feng and Mohsen Guizani and Hongfang Yu
  * View PDF
  * HTML (experimental)
  * TeX Source
  * Other Formats

view license
Current browse context:
cs.DC
< prev   |   next >
new | recent | 2024-10
Change to browse by:
cs
cs.AI

References & Citations

  * NASA ADS
  * Google Scholar
  * Semantic Scholar

a export BibTeX citation Loading...

BibTeX formatted citation

x
[loading...          ]
Data provided by:

Bookmark

BibSonomy logo Reddit logo
(*) Bibliographic Tools

Bibliographic and Citation Tools

[ ] Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
[ ] Litmaps Toggle
Litmaps (What is Litmaps?)
[ ] scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
( ) Code, Data, Media

Code, Data and Media Associated with this Article

[ ] Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
[ ] DagsHub Toggle
DagsHub (What is DagsHub?)
[ ] GotitPub Toggle
Gotit.pub (What is GotitPub?)
[ ] Links to Code Toggle
Papers with Code (What is Papers with Code?)
[ ] ScienceCast Toggle
ScienceCast (What is ScienceCast?)
( ) Demos

Demos

[ ] Replicate Toggle
Replicate (What is Replicate?)
[ ] Spaces Toggle
Hugging Face Spaces (What is Spaces?)
[ ] Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
( ) Related Papers

Recommenders and Search Tools

[ ] Link to Influence Flower
Influence Flower (What are Influence Flowers?)
[ ] Connected Papers Toggle
Connected Papers (What is Connected Papers?)
[ ] Core recommender toggle
CORE Recommender (What is CORE?)

  * Author
  * Venue
  * Institution
  * Topic

( ) About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and
share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have
embraced and accepted our values of openness, community, excellence,
and user data privacy. arXiv is committed to these values and only
works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community?
Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is
MathJax?)

  * About
  * Help

  * Click here to contact arXiv Contact
  * Click here to subscribe Subscribe

  * Copyright
  * Privacy Policy

  * Web Accessibility Assistance
  * arXiv Operational Status
    Get status notifications via email or slack