[HN Gopher] Life of an inference request (vLLM V1): How LLMs are...
___________________________________________________________________
Life of an inference request (vLLM V1): How LLMs are served
efficiently at scale
Author : samaysharma
Score : 167 points
Date : 2025-06-28 18:42 UTC (1 days ago)
(HTM) web link (www.ubicloud.com)
(TXT) w3m dump (www.ubicloud.com)
| 0xjunhao wrote:
| Hi, I'm the author of this post. Writing it was a great learning
| experience. I gained a lot of insight into vLLM. If you have any
| feedback or questions, feel free to drop a comment below!
| criemen wrote:
| Thanks for writing the article!
|
| I didn't quite get
|
| _Note that during the prefill phase, all prompt tokens from a
| request can be processed in one batch. This is possible because
| the query (Q) tensors, calculated from the tokens immediately
| before them, are available for each prompt token position._
|
| I know that in practice prefill is much faster than inference.
| Would watching the 2h video from Karpathy help me understand
| why?
| criemen wrote:
| And on the topic of prefill: Do you know what the role of
| GPUs is vs. in inference?
| animan wrote:
| Prefill is part of Inference. It's the first major step
| where you calculate all the keys and values for the input
| tokens.
|
| Decode is the next major step where you start generating
| output tokens one at a time.
|
| Both run on GPUs but have slightly different workloads
|
| 1. Prefill has very little I/o from VRAM to HBM and more
| compute 2. Decode is light on compute but have to I/o the
| keys and values computed in the prefill stage for every
| output token
| dist-epoch wrote:
| Doesn't decode also need to stream in the whole of the
| model weights, thus very I/O heavy?
| animan wrote:
| That snippet is trying to say that you can calculate KV for
| all the input tokens at once, and you don't need to loop over
| them since you have them all available.
|
| Instead for decode, you need to sequentially generate each
| token.
| longbeachbass wrote:
| Thanks for this! Learnt a lot.
|
| Curious to understand how do we ensure that the same model
| instance gets requests from the same client/user? Since
| conversations are stateful and the model needs context from
| previous turns of the conversation.
|
| Is this happening at the load balancer layer?
| cyanf wrote:
| It's either sticky sessions or an lb that keeps track of
| prior sequences and route to the instance with the largest
| match. https://docs.sglang.ai/router/router.html
| hhh wrote:
| They're not stateful, you submit the entire history with
| every call. Caching of prompts etc makes it important for
| performance to have sticky sessions or smth at the load
| balancer layer
| 3abiton wrote:
| Great write up, it would be interesting to see a lot of those
| covered features in comparison to other frameworks!
| zackangelo wrote:
| In your forward pass section you give a lot of emphasis to
| FlashAttention, but it might be worth mentioning Paged
| Attention as well (which was the paper written by the vLLM
| authors and I believe was the genesis of the project). PA-style
| block tables are now supported in most fused attention kernels,
| but vLLM originally came up with it and it's the main reason
| why vLLM has such high throughput!
| mhlakhani wrote:
| Thanks for writing this up! I learnt a bunch from it. I noticed
| this didn't discuss additional layers of caching - I can see how
| it would fit in, but is prompt caching out of the scope of this
| system?
| gdiamos wrote:
| Great write up. We use vLLM kv cache and continuous batching as a
| foundation for requests in ScalarLM and also add batching
| optimizations in a centralized queue and by adding explicit
| batching support in our client.
|
| https://www.scalarlm.com
|
| There is more perf you can sqeeuze out of vLLM
| r0b05 wrote:
| Great write up!
|
| Does batching add data from multiple requests into the same
| context, potentially decreasing perplexity? If so, are we trading
| off perplexity for lower operating costs?
| ethan_smith wrote:
| Batching in vLLM doesn't combine prompts into the same context
| - it processes separate requests in parallel while sharing
| compute resources, so there's no perplexity tradeoff, just
| efficiency gains.
| zettabomb wrote:
| It's worth noting that reason this works is because basically
| every LLM architecture currently in use is severely limited
| by memory bandwidth, not by compute. So it's trivial to run
| several requests at a time, while waiting for the next
| weights to arrive from VRAM.
| StochasticLi wrote:
| I would like to know what inference speeds they are achieving
| exactly on what hardware. I skimmed and searched the article
| and didn't find that info.
| geoffbp wrote:
| Thanks, good read!
___________________________________________________________________
(page generated 2025-06-29 23:01 UTC)