[HN Gopher] Efficient Memory Management for Large Language Model...
___________________________________________________________________
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Author : jmorgan
Score : 76 points
Date : 2023-09-14 14:42 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| maccam912 wrote:
| Without understanding most of that paper, here's a question for
| someone who might know more: can pagedattention work to make cpu
| inference faster too?
| kirill5pol wrote:
| Just from the abstract, this is primarily for batching
| inference, for batched inference using GPUs gives an order of
| magnitude speed increase so probably not something that usually
| makes sense to do on CPUs...
| liuliu wrote:
| Not only batching. It works by serving different requests as
| long as they share the prefix.
|
| This basically enables KV cache reuse when there is a prefix
| matching (from my shallow understanding of how KV cache
| works).
|
| I failed to see how this help for local deployed LLM, unless
| you consider the case you ask the same question or with the
| same prefix are high (like always starts with "please help me
| ..."?)
| Tostino wrote:
| You also have fine-tuned models for specific tasks that may
| see very similar inputs for a variety of outputs. Think an
| LLM trained on pulling out specific types of information,
| no matter where it was stored within the file. E.g. "find
| the date of the shipment for product# 5432" and then you
| pass in 10k json documents with a similar shape.
| liuliu wrote:
| Yeah, but I was under the impression that for the same
| prompt, implementations are already share the KV cache.
| This area is so new so these obvious ideas might not get
| implemented as widely as I thought.
| bestcoder69 wrote:
| Maybe if you have a model with a large context window, you
| stuff a document in the prompt as a prefix, then ask a
| bunch of different questions about the document?
| rdedev wrote:
| That would be pretty useful. I'm working on getting
| chatgpt to classify a dataset. So basically I use the
| same big prompt for a bunch of different small texts and
| ask chatgpt to generate the class label. Something like
| initializing the prompt state sounds good. Basically
| trade more processing time for more memory usage. Who
| know maybe openai is doing such optimization from their
| side
| fredliu wrote:
| I might be wrong, but looks like this could help with
| speculative decoding which can already vastly improves the
| inference speed?
| brucethemoose2 wrote:
| Only if the CPU is serving multiple users, maybe.
|
| LLMs can't batch token generation for single users. Its
| sequential, each token depends on the next. In fact that's a
| part of the paper: "dumb" batching will leave the GPU
| underutilized because responses aren't all the same length, and
| they end up processing one token at a time at the end.
| CodeL wrote:
| [flagged]
| regularfry wrote:
| Was this written by an LLM?
| Philpax wrote:
| Is any post that is unusually helpful just assumed to be the
| product of a LLM now?
| brucethemoose2 wrote:
| I think its more about ChatGPT's style of dry, technical,
| gramatically eloquent summarization without any "hot takes"
| or questions like one would expect from human commenters on
| a forum.
|
| Also the intro _feels_ like an out-of-the blue textbook
| entry, not a start to a comment.
|
| > Think of LLMs as massive libraries of information.
| Accessing and using this library efficiently, without
| wasting space or time, is crucial.
|
| Of course, plenty of human users do this, and I don't think
| OP is actually an LLM, but some suspicion is reasonable
| these days.
| arxiv_papers wrote:
| no
| [deleted]
| notpublic wrote:
| source code: https://github.com/vllm-project/vllm
| heliophobicdude wrote:
| Ah, I see. This isn't necessarily virtualizing the static weights
| but the variable -sized and data dependent key value caches.
| These caches are built up as you go through the sequence of
| tokens. Makes sense.
|
| How doesn't paging worsen speed performance though? If you are
| making more trips to the memory, then are you really just saving
| vram?
|
| Also I see that vLLM which implements PagedAttention is also
| using a better scheduling? Wouldn't the speed improvements be
| coming from that instead? Don't put an expected short input and
| output in the same batch as a big input and big output?
|
| What are the results of using the sequence-length only without
| virtualization?
| yelite wrote:
| > How doesn't paging worsen speed performance though?
|
| It does worsen the performance of the attention kernel, if
| comparing to kernels which takes keys and values in continuous
| memory layout.
|
| > Wouldn't the speed improvements be coming from that instead?
| Don't put an expected short input and output in the same batch
| as a big input and big output?
|
| Actually it puts everything in the same batch. The reason for
| its high throughput is that sequences are removed from the
| batch as soon as it's finished, and new sequences can be added
| to the batch on-the-fly if there is enough space in KV cache.
| This is called continuous batching
| (https://www.anyscale.com/blog/continuous-batching-llm-
| infere...).
|
| Paged attention and "virtualized" KV cache play an important
| role in an efficient implementation of continuous batching.
| Text generation in LLM is a dynamic process and it's not
| possible to predict how long the output is when scheduling
| incoming requests. Therefore a dynamic approach is needed for
| KV cache allocation, even though it hurts the performance of
| attention.
| arxiv_papers wrote:
| https://youtu.be/glyu_nQH0yw
___________________________________________________________________
(page generated 2023-09-14 23:00 UTC)