hngopher.com

       [HN Gopher] Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe...
       ___________________________________________________________________
        
       Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU
       bypassing the CPU
        
       Hi everyone, I'm kinda involved in some retrogaming and with some
       experiments I ran into the following question: "It would be
       possible to run transformer models bypassing the cpu/ram,
       connecting the gpu to the nvme?"  This is the result of that
       question itself and some weekend vibecoding (it has the linked
       library repository in the readme as well), it seems to work, even
       on consumer gpus, it should work better on professional ones tho
        
       Author : xaskasdf
       Score  : 8 points
       Date   : 2026-02-21 20:57 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | randomtoast wrote:
       | 0.2 tok/s is fine for experimentation, but it is not interactive
       | in any meaningful sense. For many use cases, a well-quantized 8B
       | or 13B that stays resident will simply deliver a better latency-
       | quality tradeoff
        
       ___________________________________________________________________
       (page generated 2026-02-21 23:00 UTC)