[HN Gopher] Life of an inference request (vLLM V1): How LLMs are...
       ___________________________________________________________________
        
       Life of an inference request (vLLM V1): How LLMs are served
       efficiently at scale
        
       Author : samaysharma
       Score  : 167 points
       Date   : 2025-06-28 18:42 UTC (1 days ago)
        
 (HTM) web link (www.ubicloud.com)
 (TXT) w3m dump (www.ubicloud.com)
        
       | 0xjunhao wrote:
       | Hi, I'm the author of this post. Writing it was a great learning
       | experience. I gained a lot of insight into vLLM. If you have any
       | feedback or questions, feel free to drop a comment below!
        
         | criemen wrote:
         | Thanks for writing the article!
         | 
         | I didn't quite get
         | 
         |  _Note that during the prefill phase, all prompt tokens from a
         | request can be processed in one batch. This is possible because
         | the query (Q) tensors, calculated from the tokens immediately
         | before them, are available for each prompt token position._
         | 
         | I know that in practice prefill is much faster than inference.
         | Would watching the 2h video from Karpathy help me understand
         | why?
        
           | criemen wrote:
           | And on the topic of prefill: Do you know what the role of
           | GPUs is vs. in inference?
        
             | animan wrote:
             | Prefill is part of Inference. It's the first major step
             | where you calculate all the keys and values for the input
             | tokens.
             | 
             | Decode is the next major step where you start generating
             | output tokens one at a time.
             | 
             | Both run on GPUs but have slightly different workloads
             | 
             | 1. Prefill has very little I/o from VRAM to HBM and more
             | compute 2. Decode is light on compute but have to I/o the
             | keys and values computed in the prefill stage for every
             | output token
        
               | dist-epoch wrote:
               | Doesn't decode also need to stream in the whole of the
               | model weights, thus very I/O heavy?
        
           | animan wrote:
           | That snippet is trying to say that you can calculate KV for
           | all the input tokens at once, and you don't need to loop over
           | them since you have them all available.
           | 
           | Instead for decode, you need to sequentially generate each
           | token.
        
         | longbeachbass wrote:
         | Thanks for this! Learnt a lot.
         | 
         | Curious to understand how do we ensure that the same model
         | instance gets requests from the same client/user? Since
         | conversations are stateful and the model needs context from
         | previous turns of the conversation.
         | 
         | Is this happening at the load balancer layer?
        
           | cyanf wrote:
           | It's either sticky sessions or an lb that keeps track of
           | prior sequences and route to the instance with the largest
           | match. https://docs.sglang.ai/router/router.html
        
           | hhh wrote:
           | They're not stateful, you submit the entire history with
           | every call. Caching of prompts etc makes it important for
           | performance to have sticky sessions or smth at the load
           | balancer layer
        
         | 3abiton wrote:
         | Great write up, it would be interesting to see a lot of those
         | covered features in comparison to other frameworks!
        
         | zackangelo wrote:
         | In your forward pass section you give a lot of emphasis to
         | FlashAttention, but it might be worth mentioning Paged
         | Attention as well (which was the paper written by the vLLM
         | authors and I believe was the genesis of the project). PA-style
         | block tables are now supported in most fused attention kernels,
         | but vLLM originally came up with it and it's the main reason
         | why vLLM has such high throughput!
        
       | mhlakhani wrote:
       | Thanks for writing this up! I learnt a bunch from it. I noticed
       | this didn't discuss additional layers of caching - I can see how
       | it would fit in, but is prompt caching out of the scope of this
       | system?
        
       | gdiamos wrote:
       | Great write up. We use vLLM kv cache and continuous batching as a
       | foundation for requests in ScalarLM and also add batching
       | optimizations in a centralized queue and by adding explicit
       | batching support in our client.
       | 
       | https://www.scalarlm.com
       | 
       | There is more perf you can sqeeuze out of vLLM
        
       | r0b05 wrote:
       | Great write up!
       | 
       | Does batching add data from multiple requests into the same
       | context, potentially decreasing perplexity? If so, are we trading
       | off perplexity for lower operating costs?
        
         | ethan_smith wrote:
         | Batching in vLLM doesn't combine prompts into the same context
         | - it processes separate requests in parallel while sharing
         | compute resources, so there's no perplexity tradeoff, just
         | efficiency gains.
        
           | zettabomb wrote:
           | It's worth noting that reason this works is because basically
           | every LLM architecture currently in use is severely limited
           | by memory bandwidth, not by compute. So it's trivial to run
           | several requests at a time, while waiting for the next
           | weights to arrive from VRAM.
        
         | StochasticLi wrote:
         | I would like to know what inference speeds they are achieving
         | exactly on what hardware. I skimmed and searched the article
         | and didn't find that info.
        
       | geoffbp wrote:
       | Thanks, good read!
        
       ___________________________________________________________________
       (page generated 2025-06-29 23:01 UTC)