[HN Gopher] Nano-Vllm: Lightweight vLLM implementation built fro...
       ___________________________________________________________________
        
       Nano-Vllm: Lightweight vLLM implementation built from scratch
        
       Author : simonpure
       Score  : 95 points
       Date   : 2025-06-23 05:10 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | unwind wrote:
       | Meta: the Title Casing in the title is pretty obnoxious, "Vllm"
       | is exactly the inverse, casing-wise, of how the project spells
       | its name.
        
       | futurecliff wrote:
       | how did u do it? which portion of vllm refactoring allowed u to
       | get such gains.
        
       | zackify wrote:
       | Will this end up getting an open ai compatible web server or is
       | that out of scope.
        
       | jimmySixDOF wrote:
       | Little sparse on the documentation side can't tell at a glance if
       | there is a 1:1 hyperperameter tuneability or if this is an
       | opinionated single path locked soft fpga eval-hacking kind of
       | thing.
       | 
       | EDIT: -- Ok, it's legit, here is an example of it put to use by
       | the makers of the Dolphin OpenSource series of FineTunes:
       | 
       | > Here I implement in nano-vllm, efficient sample-K logit
       | extraction, as described in "Sparse Logit Sampling: Accelerating
       | Knowledge Distillation in LLMs" by Anshumann et. al. Sampling
       | occurs on the GPU, the non-sampled logits do not get copied out
       | of GPU space. I tried to implement this in @vllm_project, but it
       | was a bit too heavy for me to figure out.
       | 
       | https://github.com/GeeeekExplorer/nano-vllm/pull/34
        
       | baalimago wrote:
       | So... It's a language model..? As in, not "large"? I'm a bit
       | unsure of the magnitudes here, but surely "nano" and "large"
       | cancel out
        
         | IanCal wrote:
         | No, vLLM is a thing for serving language models:
         | https://github.com/vllm-project/vllm
        
           | barrenko wrote:
           | Is it more like llama.cpp then? I don't have access to the
           | good hardware.
        
       | fractorial wrote:
       | Did anyone else click in excitedly after misreading 'Vllm' as
       | 'LLVM?'
        
       | omneity wrote:
       | This is an incredible achievement for a solo developer. The dev
       | is from the Deepseek team by the way.
        
         | Imustaskforhelp wrote:
         | That is crazy! This is so cool ngl.
        
       | tt726259 wrote:
       | After seeing the Docker image for vllm jump +5Gb (to 10Gb!) over
       | the past five months, I grew suspicious of vllm's development
       | practices [1]. It's not easy, for sure, to deal with all those
       | flaky python modules [2].
       | 
       | But having the CUDA packages four times in different layers is
       | questionable! [3]
       | 
       | Yet again, as a college mate of mine used to say, "Don't change
       | it. It works."
       | 
       | --
       | 
       | [1]: https://hub.docker.com/r/vllm/vllm-openai/tags
       | 
       | [2]: https://github.com/vllm-project/vllm/issues/13306
       | 
       | [3]: These kinds of workarounds tend to end up accumulating and
       | never get reviewed back:
       | 
       | - https://github.com/vllm-project/vllm/commit/b07d741661570ef1...
       | 
       | - https://github.com/vllm-project/vllm/commit/68d37809b9b52f4d...
       | (this one in particular probably accounts for +3Gb)
        
       | b0a04gl wrote:
       | i was skimming through this and kinda surprised how tight the
       | whole thing is. like it does 90% of what vllm does, but the
       | code's readable end to end. no extra infra, no orchestration
       | layers yelling at you. i got it running on local in minutes and
       | throughput actually beat vllm on my 4070. wasn't expecting that.
       | 
       | if we can do this level of performance in 1.2k lines, what if we
       | go the other way split the model across devices or even machines,
       | stream token-by-token, but keep prefix cache consistent across
       | hops. can we design inference engines that think in terms of
       | modular attention scopes instead of monolithic graphs? is it even
       | possible
        
       | mountainriver wrote:
       | Love this project, we need more simplifications like this in the
       | current ML environment
        
       ___________________________________________________________________
       (page generated 2025-06-23 23:01 UTC)