[HN Gopher] LLM Inference Handbook
___________________________________________________________________
LLM Inference Handbook
Author : djhu9
Score : 283 points
Date : 2025-07-11 02:40 UTC (20 hours ago)
(HTM) web link (bentoml.com)
(TXT) w3m dump (bentoml.com)
| sherlockxu wrote:
| Hi everyone. I'm one of the maintainers of this project. We're
| both excited and humbled to see it on Hacker News!
|
| We created this handbook to make LLM inference concepts more
| accessible, especially for developers building real-world LLM
| applications. The goal is to pull together scattered knowledge
| into something clear, practical, and easy to build on.
|
| We're continuing to improve it, so feedback is very welcome!
|
| GitHub repo: https://github.com/bentoml/llm-inference-in-
| production
| armcat wrote:
| Amazing work on this, beautifully put together and very useful!
| DiabloD3 wrote:
| I'm not going to open an issue on this, but you should consider
| expanding on the self-hosting part of the handbook and
| explicitly recommend llama.cpp for local self-hosted inference.
| leopoldj wrote:
| The self hosting section covers corporate use case using vLlm
| and sglang as well as personal desktop use using Ollama which
| is a wrapper over llama.cpp.
| criemen wrote:
| Thanks a lot for putting this together!
|
| I have a question. In https://github.com/bentoml/llm-inference-
| in-production/blob/..., you have a single picture that defines
| TTFT and ITL. That does not match my understanding (but you
| guys know probably more than me): In the graphic, it looks like
| that the model is generating 4 tokens T0 to T3, before
| outputting a single output token.
|
| I'd have expected that picture for ITL (except that then the
| labeling of the last box is off), but for TTFT, I'd have
| expected that there's only a single token T0 from the decode
| step, that then immediately is handed to detokenization and
| arrives as first output token (if we assume a streaming setup,
| otherwise measuring TTFT makes little sense).
| sethherr wrote:
| This seems useful and well put together, but splitting it into
| many small pages instead of a single page that can be scrolled
| through is frustrating - particularly on mobile where the table
| of contents isn't shown by default. I stopped reading after a
| few pages because it annoyed me.
|
| At the very least, the sections should be a single page each.
| aligundogdu wrote:
| It's a really beautiful project, and I'd like to ask something
| purely out of curiosity and with the best intentions. What's the
| name of the design trend you used for your website? I really
| loved the website too.
| Jimmc414 wrote:
| it appears to be using Infima, which is Docusaurus's default
| CSS framework plus a standard system font stack
|
| [0] font-family: -apple-system, BlinkMacSystemFont, "Segoe UI",
| Roboto, Helvetica, Arial, sans-serif;
| holografix wrote:
| Very good reference thanks for collating this!
| subset wrote:
| Ooh this looks really neat! I'd love to see more content in the
| future on Structured outputs/Guided generation and sampling.
| Another great reference on inference-time algorithms for sampling
| is here: https://rentry.co/samplers
| larme wrote:
| Wow that's really thorough
| qrios wrote:
| Thanks for putting this together! From now on I only need one
| link to point interested ones to learn.
|
| Only one suggestion: On page "OpenAI-compatible API" it would be
| great to have also a simple example for the pure REST call
| instead of the need to import the OpenAI package.
| srameshc wrote:
| If I remember, BentoML was about MLOps, I remember trying it
| about a year back. Did the company pivot ?
| fsjayess wrote:
| There is a big pie in the market around LLM serving. It make
| sense for a serving framework to extend into the space
| gchadwick wrote:
| Very glad to see this. There is (understandably) much excitement
| and focus on training models in publicly available material.
|
| Running them well is very important too. As we get to grips with
| everything models can do and look to deploy them widely knowledge
| of how to best run them becomes ever more important.
___________________________________________________________________
(page generated 2025-07-11 23:00 UTC)