hngopher.com

       [HN Gopher] Fast LLM Inference From Scratch (using CUDA)
       ___________________________________________________________________
        
       Fast LLM Inference From Scratch (using CUDA)
        
       Author : homarp
       Score  : 119 points
       Date   : 2024-12-14 16:02 UTC (1 days ago)
        
 (HTM) web link (andrewkchan.dev)
 (TXT) w3m dump (andrewkchan.dev)
        
       | fancyfredbot wrote:
       | I don't think this code can make use of the tensor cores, or the
       | wgmma instructions that you typically need to get peak
       | performance out of them.
       | 
       | Programming these is a nightmare as you need to have several in
       | flight concurrently for peak performance.
       | 
       | Perhaps you don't need the extra flops as you end up bandwidth
       | bound?
       | 
       | Regardless the good thing about the code in the blog though is
       | it'll probably work pretty well for other accelerators, if you
       | port it to HIP or similar. If you use wgmma I'm not sure it'll
       | even be portable across Nvidia generations.
        
         | chillee wrote:
         | For latency-bound inference (i.e. one request) you don't need
         | tensor-cores since all your operations are just matrix vector
         | multiplications.
        
           | fancyfredbot wrote:
           | Good point yes. That explains why he's getting performance
           | similar to the leading frameworks. Those tensor operations
           | are helpful for training or for throughput-optimised batched
           | inference but not really for a batch size of one.
        
       | shihab wrote:
       | Excellent, amazing article.
       | 
       | To the author, if you're lurking here, I have a tangential
       | question- how long did it take you to write this article? From
       | first line of code to the last line of this post?
       | 
       | As someone who works in GPGPU space, I can imagine myself writing
       | an article of this sort. But the huge uncertainty around time
       | needed has deterred me so far.
        
       | Const-me wrote:
       | I wonder how does the perf in tokens/second compares to my
       | version of Mistral: https://github.com/Const-
       | me/Cgml/tree/master/Mistral/Mistral...
       | 
       | BTW, see that section of the readme about quantization:
       | https://github.com/Const-me/Cgml/tree/master?tab=readme-ov-f...
        
       | diego898 wrote:
       | This is great thank you!
       | 
       | Does any one know of something similar in python? I want to share
       | with my team something similar to this that goes into (almost)
       | everything (at least conceptually) needed to efficiently serve an
       | LLM.
       | 
       | It doesn't actually need to be performant mind you (it's in
       | python) I just need something "conceptually complete" while being
       | more "tutorial style" and concise than vLLM codebase
        
       ___________________________________________________________________
       (page generated 2024-12-15 23:00 UTC)