[HN Gopher] Fast LLM Inference From Scratch (using CUDA)
___________________________________________________________________
Fast LLM Inference From Scratch (using CUDA)
Author : homarp
Score : 119 points
Date : 2024-12-14 16:02 UTC (1 days ago)
(HTM) web link (andrewkchan.dev)
(TXT) w3m dump (andrewkchan.dev)
| fancyfredbot wrote:
| I don't think this code can make use of the tensor cores, or the
| wgmma instructions that you typically need to get peak
| performance out of them.
|
| Programming these is a nightmare as you need to have several in
| flight concurrently for peak performance.
|
| Perhaps you don't need the extra flops as you end up bandwidth
| bound?
|
| Regardless the good thing about the code in the blog though is
| it'll probably work pretty well for other accelerators, if you
| port it to HIP or similar. If you use wgmma I'm not sure it'll
| even be portable across Nvidia generations.
| chillee wrote:
| For latency-bound inference (i.e. one request) you don't need
| tensor-cores since all your operations are just matrix vector
| multiplications.
| fancyfredbot wrote:
| Good point yes. That explains why he's getting performance
| similar to the leading frameworks. Those tensor operations
| are helpful for training or for throughput-optimised batched
| inference but not really for a batch size of one.
| shihab wrote:
| Excellent, amazing article.
|
| To the author, if you're lurking here, I have a tangential
| question- how long did it take you to write this article? From
| first line of code to the last line of this post?
|
| As someone who works in GPGPU space, I can imagine myself writing
| an article of this sort. But the huge uncertainty around time
| needed has deterred me so far.
| Const-me wrote:
| I wonder how does the perf in tokens/second compares to my
| version of Mistral: https://github.com/Const-
| me/Cgml/tree/master/Mistral/Mistral...
|
| BTW, see that section of the readme about quantization:
| https://github.com/Const-me/Cgml/tree/master?tab=readme-ov-f...
| diego898 wrote:
| This is great thank you!
|
| Does any one know of something similar in python? I want to share
| with my team something similar to this that goes into (almost)
| everything (at least conceptually) needed to efficiently serve an
| LLM.
|
| It doesn't actually need to be performant mind you (it's in
| python) I just need something "conceptually complete" while being
| more "tutorial style" and concise than vLLM codebase
___________________________________________________________________
(page generated 2024-12-15 23:00 UTC)