[HN Gopher] Tokasaurus: An LLM Inference Engine for High-Through...
___________________________________________________________________
Tokasaurus: An LLM Inference Engine for High-Throughput Workloads
Author : rsehrlich
Score : 43 points
Date : 2025-06-05 21:27 UTC (1 hours ago)
(HTM) web link (scalingintelligence.stanford.edu)
(TXT) w3m dump (scalingintelligence.stanford.edu)
| behnamoh wrote:
| While Tokasaurus's Async-TP shows impressive throughput gains, it
| seems over-engineered for common use cases. The CPU overhead from
| async tensor parallelism only pays off at 6k+ token batches, and
| you need NVLink-connected GPUs to see real benefits. Most prod
| deployments don't need this complexity -- you're better off with
| simpler approaches unless you're specifically optimizing for
| massive batch throughput. The adaptive manager skipping
| "optional" tasks under load also feels concerning from a
| reliability perspective.
| bjt12345 wrote:
| Buy surely next years production deployments will be very
| different to right now, with different use cases...etc
| YetAnotherNick wrote:
| Depends on what production means for you. This is useful for
| batch production jobs.
|
| Also, this seems very useful for generating synthetic data or
| labelling a bunch of data. 6k batch size is small for data
| labelling.
| nabakin wrote:
| > On throughput-focused benchmarks, Tokasaurus can outperform
| vLLM and SGLang by up to 3x+.
|
| Looks like they don't compare to TensorRT-LLM throughput numbers
| which, last I checked, are SOTA in open source.
| symbolicAGI wrote:
| Given chat and API needs for low-latency, llama.cpp is probably
| still the best choice for self hosted models with or without GPU
| support. And Ollama is the leader for wrapping llama.cpp.
|
| Because Tokasaurus was mentioned as better than Ollama for
| conducting darwinian godel machine operations (self-improvement),
| I looked for the linked repo on GitHub and it was 404. So glad it
| is back https://github.com/ScalingIntelligence/tokasaurus.
___________________________________________________________________
(page generated 2025-06-05 23:00 UTC)