hngopher.com

       [HN Gopher] LLM Inference Handbook
       ___________________________________________________________________
        
       LLM Inference Handbook
        
       Author : djhu9
       Score  : 283 points
       Date   : 2025-07-11 02:40 UTC (20 hours ago)
        
 (HTM) web link (bentoml.com)
 (TXT) w3m dump (bentoml.com)
        
       | sherlockxu wrote:
       | Hi everyone. I'm one of the maintainers of this project. We're
       | both excited and humbled to see it on Hacker News!
       | 
       | We created this handbook to make LLM inference concepts more
       | accessible, especially for developers building real-world LLM
       | applications. The goal is to pull together scattered knowledge
       | into something clear, practical, and easy to build on.
       | 
       | We're continuing to improve it, so feedback is very welcome!
       | 
       | GitHub repo: https://github.com/bentoml/llm-inference-in-
       | production
        
         | armcat wrote:
         | Amazing work on this, beautifully put together and very useful!
        
         | DiabloD3 wrote:
         | I'm not going to open an issue on this, but you should consider
         | expanding on the self-hosting part of the handbook and
         | explicitly recommend llama.cpp for local self-hosted inference.
        
           | leopoldj wrote:
           | The self hosting section covers corporate use case using vLlm
           | and sglang as well as personal desktop use using Ollama which
           | is a wrapper over llama.cpp.
        
         | criemen wrote:
         | Thanks a lot for putting this together!
         | 
         | I have a question. In https://github.com/bentoml/llm-inference-
         | in-production/blob/..., you have a single picture that defines
         | TTFT and ITL. That does not match my understanding (but you
         | guys know probably more than me): In the graphic, it looks like
         | that the model is generating 4 tokens T0 to T3, before
         | outputting a single output token.
         | 
         | I'd have expected that picture for ITL (except that then the
         | labeling of the last box is off), but for TTFT, I'd have
         | expected that there's only a single token T0 from the decode
         | step, that then immediately is handed to detokenization and
         | arrives as first output token (if we assume a streaming setup,
         | otherwise measuring TTFT makes little sense).
        
         | sethherr wrote:
         | This seems useful and well put together, but splitting it into
         | many small pages instead of a single page that can be scrolled
         | through is frustrating - particularly on mobile where the table
         | of contents isn't shown by default. I stopped reading after a
         | few pages because it annoyed me.
         | 
         | At the very least, the sections should be a single page each.
        
       | aligundogdu wrote:
       | It's a really beautiful project, and I'd like to ask something
       | purely out of curiosity and with the best intentions. What's the
       | name of the design trend you used for your website? I really
       | loved the website too.
        
         | Jimmc414 wrote:
         | it appears to be using Infima, which is Docusaurus's default
         | CSS framework plus a standard system font stack
         | 
         | [0] font-family: -apple-system, BlinkMacSystemFont, "Segoe UI",
         | Roboto, Helvetica, Arial, sans-serif;
        
       | holografix wrote:
       | Very good reference thanks for collating this!
        
       | subset wrote:
       | Ooh this looks really neat! I'd love to see more content in the
       | future on Structured outputs/Guided generation and sampling.
       | Another great reference on inference-time algorithms for sampling
       | is here: https://rentry.co/samplers
        
         | larme wrote:
         | Wow that's really thorough
        
       | qrios wrote:
       | Thanks for putting this together! From now on I only need one
       | link to point interested ones to learn.
       | 
       | Only one suggestion: On page "OpenAI-compatible API" it would be
       | great to have also a simple example for the pure REST call
       | instead of the need to import the OpenAI package.
        
       | srameshc wrote:
       | If I remember, BentoML was about MLOps, I remember trying it
       | about a year back. Did the company pivot ?
        
         | fsjayess wrote:
         | There is a big pie in the market around LLM serving. It make
         | sense for a serving framework to extend into the space
        
       | gchadwick wrote:
       | Very glad to see this. There is (understandably) much excitement
       | and focus on training models in publicly available material.
       | 
       | Running them well is very important too. As we get to grips with
       | everything models can do and look to deploy them widely knowledge
       | of how to best run them becomes ever more important.
        
       ___________________________________________________________________
       (page generated 2025-07-11 23:00 UTC)