hngopher.com

       [HN Gopher] Tokasaurus: An LLM Inference Engine for High-Through...
       ___________________________________________________________________
        
       Tokasaurus: An LLM Inference Engine for High-Throughput Workloads
        
       Author : rsehrlich
       Score  : 43 points
       Date   : 2025-06-05 21:27 UTC (1 hours ago)
        
 (HTM) web link (scalingintelligence.stanford.edu)
 (TXT) w3m dump (scalingintelligence.stanford.edu)
        
       | behnamoh wrote:
       | While Tokasaurus's Async-TP shows impressive throughput gains, it
       | seems over-engineered for common use cases. The CPU overhead from
       | async tensor parallelism only pays off at 6k+ token batches, and
       | you need NVLink-connected GPUs to see real benefits. Most prod
       | deployments don't need this complexity -- you're better off with
       | simpler approaches unless you're specifically optimizing for
       | massive batch throughput. The adaptive manager skipping
       | "optional" tasks under load also feels concerning from a
       | reliability perspective.
        
         | bjt12345 wrote:
         | Buy surely next years production deployments will be very
         | different to right now, with different use cases...etc
        
         | YetAnotherNick wrote:
         | Depends on what production means for you. This is useful for
         | batch production jobs.
         | 
         | Also, this seems very useful for generating synthetic data or
         | labelling a bunch of data. 6k batch size is small for data
         | labelling.
        
       | nabakin wrote:
       | > On throughput-focused benchmarks, Tokasaurus can outperform
       | vLLM and SGLang by up to 3x+.
       | 
       | Looks like they don't compare to TensorRT-LLM throughput numbers
       | which, last I checked, are SOTA in open source.
        
       | symbolicAGI wrote:
       | Given chat and API needs for low-latency, llama.cpp is probably
       | still the best choice for self hosted models with or without GPU
       | support. And Ollama is the leader for wrapping llama.cpp.
       | 
       | Because Tokasaurus was mentioned as better than Ollama for
       | conducting darwinian godel machine operations (self-improvement),
       | I looked for the linked repo on GitHub and it was 404. So glad it
       | is back https://github.com/ScalingIntelligence/tokasaurus.
        
       ___________________________________________________________________
       (page generated 2025-06-05 23:00 UTC)