[HN Gopher] Efficient Memory Management for Large Language Model...
       ___________________________________________________________________
        
       Efficient Memory Management for Large Language Model Serving with
       PagedAttention
        
       Author : jmorgan
       Score  : 76 points
       Date   : 2023-09-14 14:42 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | maccam912 wrote:
       | Without understanding most of that paper, here's a question for
       | someone who might know more: can pagedattention work to make cpu
       | inference faster too?
        
         | kirill5pol wrote:
         | Just from the abstract, this is primarily for batching
         | inference, for batched inference using GPUs gives an order of
         | magnitude speed increase so probably not something that usually
         | makes sense to do on CPUs...
        
           | liuliu wrote:
           | Not only batching. It works by serving different requests as
           | long as they share the prefix.
           | 
           | This basically enables KV cache reuse when there is a prefix
           | matching (from my shallow understanding of how KV cache
           | works).
           | 
           | I failed to see how this help for local deployed LLM, unless
           | you consider the case you ask the same question or with the
           | same prefix are high (like always starts with "please help me
           | ..."?)
        
             | Tostino wrote:
             | You also have fine-tuned models for specific tasks that may
             | see very similar inputs for a variety of outputs. Think an
             | LLM trained on pulling out specific types of information,
             | no matter where it was stored within the file. E.g. "find
             | the date of the shipment for product# 5432" and then you
             | pass in 10k json documents with a similar shape.
        
               | liuliu wrote:
               | Yeah, but I was under the impression that for the same
               | prompt, implementations are already share the KV cache.
               | This area is so new so these obvious ideas might not get
               | implemented as widely as I thought.
        
             | bestcoder69 wrote:
             | Maybe if you have a model with a large context window, you
             | stuff a document in the prompt as a prefix, then ask a
             | bunch of different questions about the document?
        
               | rdedev wrote:
               | That would be pretty useful. I'm working on getting
               | chatgpt to classify a dataset. So basically I use the
               | same big prompt for a bunch of different small texts and
               | ask chatgpt to generate the class label. Something like
               | initializing the prompt state sounds good. Basically
               | trade more processing time for more memory usage. Who
               | know maybe openai is doing such optimization from their
               | side
        
           | fredliu wrote:
           | I might be wrong, but looks like this could help with
           | speculative decoding which can already vastly improves the
           | inference speed?
        
         | brucethemoose2 wrote:
         | Only if the CPU is serving multiple users, maybe.
         | 
         | LLMs can't batch token generation for single users. Its
         | sequential, each token depends on the next. In fact that's a
         | part of the paper: "dumb" batching will leave the GPU
         | underutilized because responses aren't all the same length, and
         | they end up processing one token at a time at the end.
        
       | CodeL wrote:
       | [flagged]
        
         | regularfry wrote:
         | Was this written by an LLM?
        
           | Philpax wrote:
           | Is any post that is unusually helpful just assumed to be the
           | product of a LLM now?
        
             | brucethemoose2 wrote:
             | I think its more about ChatGPT's style of dry, technical,
             | gramatically eloquent summarization without any "hot takes"
             | or questions like one would expect from human commenters on
             | a forum.
             | 
             | Also the intro _feels_ like an out-of-the blue textbook
             | entry, not a start to a comment.
             | 
             | > Think of LLMs as massive libraries of information.
             | Accessing and using this library efficiently, without
             | wasting space or time, is crucial.
             | 
             | Of course, plenty of human users do this, and I don't think
             | OP is actually an LLM, but some suspicion is reasonable
             | these days.
        
           | arxiv_papers wrote:
           | no
        
         | [deleted]
        
       | notpublic wrote:
       | source code: https://github.com/vllm-project/vllm
        
       | heliophobicdude wrote:
       | Ah, I see. This isn't necessarily virtualizing the static weights
       | but the variable -sized and data dependent key value caches.
       | These caches are built up as you go through the sequence of
       | tokens. Makes sense.
       | 
       | How doesn't paging worsen speed performance though? If you are
       | making more trips to the memory, then are you really just saving
       | vram?
       | 
       | Also I see that vLLM which implements PagedAttention is also
       | using a better scheduling? Wouldn't the speed improvements be
       | coming from that instead? Don't put an expected short input and
       | output in the same batch as a big input and big output?
       | 
       | What are the results of using the sequence-length only without
       | virtualization?
        
         | yelite wrote:
         | > How doesn't paging worsen speed performance though?
         | 
         | It does worsen the performance of the attention kernel, if
         | comparing to kernels which takes keys and values in continuous
         | memory layout.
         | 
         | > Wouldn't the speed improvements be coming from that instead?
         | Don't put an expected short input and output in the same batch
         | as a big input and big output?
         | 
         | Actually it puts everything in the same batch. The reason for
         | its high throughput is that sequences are removed from the
         | batch as soon as it's finished, and new sequences can be added
         | to the batch on-the-fly if there is enough space in KV cache.
         | This is called continuous batching
         | (https://www.anyscale.com/blog/continuous-batching-llm-
         | infere...).
         | 
         | Paged attention and "virtualized" KV cache play an important
         | role in an efficient implementation of continuous batching.
         | Text generation in LLM is a dynamic process and it's not
         | possible to predict how long the output is when scheduling
         | incoming requests. Therefore a dynamic approach is needed for
         | KV cache allocation, even though it hurts the performance of
         | attention.
        
       | arxiv_papers wrote:
       | https://youtu.be/glyu_nQH0yw
        
       ___________________________________________________________________
       (page generated 2023-09-14 23:00 UTC)