[HN Gopher] Nano-Vllm: Lightweight vLLM implementation built fro...
___________________________________________________________________
Nano-Vllm: Lightweight vLLM implementation built from scratch
Author : simonpure
Score : 95 points
Date : 2025-06-23 05:10 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| unwind wrote:
| Meta: the Title Casing in the title is pretty obnoxious, "Vllm"
| is exactly the inverse, casing-wise, of how the project spells
| its name.
| futurecliff wrote:
| how did u do it? which portion of vllm refactoring allowed u to
| get such gains.
| zackify wrote:
| Will this end up getting an open ai compatible web server or is
| that out of scope.
| jimmySixDOF wrote:
| Little sparse on the documentation side can't tell at a glance if
| there is a 1:1 hyperperameter tuneability or if this is an
| opinionated single path locked soft fpga eval-hacking kind of
| thing.
|
| EDIT: -- Ok, it's legit, here is an example of it put to use by
| the makers of the Dolphin OpenSource series of FineTunes:
|
| > Here I implement in nano-vllm, efficient sample-K logit
| extraction, as described in "Sparse Logit Sampling: Accelerating
| Knowledge Distillation in LLMs" by Anshumann et. al. Sampling
| occurs on the GPU, the non-sampled logits do not get copied out
| of GPU space. I tried to implement this in @vllm_project, but it
| was a bit too heavy for me to figure out.
|
| https://github.com/GeeeekExplorer/nano-vllm/pull/34
| baalimago wrote:
| So... It's a language model..? As in, not "large"? I'm a bit
| unsure of the magnitudes here, but surely "nano" and "large"
| cancel out
| IanCal wrote:
| No, vLLM is a thing for serving language models:
| https://github.com/vllm-project/vllm
| barrenko wrote:
| Is it more like llama.cpp then? I don't have access to the
| good hardware.
| fractorial wrote:
| Did anyone else click in excitedly after misreading 'Vllm' as
| 'LLVM?'
| omneity wrote:
| This is an incredible achievement for a solo developer. The dev
| is from the Deepseek team by the way.
| Imustaskforhelp wrote:
| That is crazy! This is so cool ngl.
| tt726259 wrote:
| After seeing the Docker image for vllm jump +5Gb (to 10Gb!) over
| the past five months, I grew suspicious of vllm's development
| practices [1]. It's not easy, for sure, to deal with all those
| flaky python modules [2].
|
| But having the CUDA packages four times in different layers is
| questionable! [3]
|
| Yet again, as a college mate of mine used to say, "Don't change
| it. It works."
|
| --
|
| [1]: https://hub.docker.com/r/vllm/vllm-openai/tags
|
| [2]: https://github.com/vllm-project/vllm/issues/13306
|
| [3]: These kinds of workarounds tend to end up accumulating and
| never get reviewed back:
|
| - https://github.com/vllm-project/vllm/commit/b07d741661570ef1...
|
| - https://github.com/vllm-project/vllm/commit/68d37809b9b52f4d...
| (this one in particular probably accounts for +3Gb)
| b0a04gl wrote:
| i was skimming through this and kinda surprised how tight the
| whole thing is. like it does 90% of what vllm does, but the
| code's readable end to end. no extra infra, no orchestration
| layers yelling at you. i got it running on local in minutes and
| throughput actually beat vllm on my 4070. wasn't expecting that.
|
| if we can do this level of performance in 1.2k lines, what if we
| go the other way split the model across devices or even machines,
| stream token-by-token, but keep prefix cache consistent across
| hops. can we design inference engines that think in terms of
| modular attention scopes instead of monolithic graphs? is it even
| possible
| mountainriver wrote:
| Love this project, we need more simplifications like this in the
| current ML environment
___________________________________________________________________
(page generated 2025-06-23 23:01 UTC)