[HN Gopher] Compiling LLMs into a MegaKernel: A path to low-late...
       ___________________________________________________________________
        
       Compiling LLMs into a MegaKernel: A path to low-latency inference
        
       Author : matt_d
       Score  : 95 points
       Date   : 2025-06-19 19:20 UTC (3 hours ago)
        
 (HTM) web link (zhihaojia.medium.com)
 (TXT) w3m dump (zhihaojia.medium.com)
        
       | NitroPython wrote:
       | Ollama integration?
        
       | baq wrote:
       | Next step - compile straight to verilog so I can buy some LLMs on
       | aliexpress
        
         | bigcat12345678 wrote:
         | https://riscv.org/blog/2021/02/hardware-description-language...
         | That was one of the promising ideas before AI & GPUs come to
         | the scene. As CPUs are stagnant, and naturally people want
         | further optimize the middle layers software and hardware.
         | 
         | But I suspect parallel computing in GPU style is going to
         | dominate acclerated computing.
         | 
         | General purpose CPUs are going to stay to become the little
         | brain that orchestrates GPUs.
         | 
         | Ideas of software direct to hardware transition might never be
         | the mainstream.
        
           | baq wrote:
           | I'm thinking more like pseudointellect over serial to attach
           | a $3 esp32 to. Since it's basically tokens in, tokens out,
           | let's just cut the unnecessary parts out. It's like querying
           | the cloud models, except it's your silicon you personally
           | soldered to the esp so nobody will break your home assistant
           | with a system prompt update or a fine tuning run.
        
       | scotty79 wrote:
       | > Traditional LLM systems often rely on sequences of GPU kernel
       | launches and external communication calls, resulting in
       | underutilized hardware.
       | 
       | What? Why? This seems like an obvious optimization if it's
       | possible.
        
         | shawntan wrote:
         | Systems might want to anticipate changes in LLM architectures
         | (even small changes can make a big difference kernel wise), so
         | it's good to not "bake" too much in ahead of time.
         | 
         | That said, at some point it just depends where the costs lie
         | and it might make sense hiring some GPU engineers to do what
         | they did here for whatever architecture you're optimising for.
         | 
         | Not as low-hanging as you might imagine.
        
         | catlifeonmars wrote:
         | From the article
         | 
         | > Despite these advantages, compiling an LLM into a megakernel
         | is highly challenging. Existing high-level ML frameworks --
         | such as PyTorch, Triton, and TVM -- do not natively support
         | end-to-end megakernel generation. Additionally, modern LLM
         | systems are built from a diverse collection of specialized
         | kernel libraries: NCCL or NVSHMEM for communication, FlashInfer
         | or FlashAttention for efficient attention, and CUDA or Triton
         | for custom computation. This fragmentation makes it difficult
         | to consolidate the entire inference pipeline into a single,
         | unified kernel.
         | 
         | So my naive assumption is that yes it is obvious, but
         | nontrivial.
        
           | saagarjha wrote:
           | Your naive assumption is the right one. It's quite hard to do
           | this. Even doing it automatically like it's done here runs
           | into problems with trying to figure out data dependencies and
           | synchronization across nontrivial computation.
        
         | liuliu wrote:
         | It really is not obvious. These launches are asynchronous, and
         | data movement / computation is overlapped properly through CUDA
         | APIs. Even per-kernel launch cost is reduced with the cudagraph
         | introduction.
         | 
         | CUDA programming model relies on each kernel to be
         | computationally expensive to make sense, and these are not true
         | for token generation of LLM. And we are talking about network
         | evaluation at higher than 1000 per second, whereas previously
         | besides recommendation systems, network evaluation we are look
         | at is ~100 per second at most.
         | 
         | Also, nobody remember Alex's "One Weird Trick" paper, which
         | slices matmul into pieces to overlap device-to-device transfer
         | v.s. computation. That is 10 years ago.
        
         | delusional wrote:
         | In the common case where the processor dispatching those kernel
         | calls is much faster than the kernel calls themselves, you're
         | not likely to see a meaningful increase in throughput.
         | 
         | What you need to do first is get really optimized kernels
         | (since that makes the dispatching relatively more expensive)
         | and THEN this becomes worth doing. People who are really good a
         | writing optimized GPU kernels are just not that easy to get a
         | hold of right now.
        
       | bytepoet wrote:
       | This is very cool. I enjoyed going through the writeup and GitHub
       | README.
       | 
       | I was wondering if these same optimizations can be brought to
       | bear on training as well, rather than only inference. I guess the
       | challenge here is fusing backward computations with gradient
       | communication.
       | 
       | I also saw that this currently does not handle dynamic workloads
       | such as MoE. I recently came across this paper that does exactly
       | this:
       | 
       | FlashDMoE: Fast Distributed MoE in a Single Kernel -
       | https://arxiv.org/pdf/2506.04667
        
         | zhihaojia wrote:
         | Thanks for reading the post and github README. Supporting
         | training is definitely feasible but the benefit may not be as
         | significant as low-latency inference since training generally
         | involves much larger kernels, making kernel launch overhead
         | less significant.
         | 
         | Thanks for sharing the FlashDMoE work. Our next step is to
         | support MoE models. Stay tuned!
        
       | liuliu wrote:
       | The Qwen 8B number, if verified, is very impressive. Much more
       | practical than the previous megakernel one.
       | 
       | That's being said, these one-persisted kernel on each SM reminds
       | me Larrabee, and now wondering what the world will be if we just
       | do traditional process-thread-simd path rather than CUDA path.
        
       | kp1197 wrote:
       | After working pretty closely with vLLM and SGLang over the past
       | few months, this is EXACTLY what I had envisioned what a
       | successor project would look like - analyzing an operation
       | dependency graph and then fusing (or, at a minimum, scheduling
       | tasks smarter). Congrats to the team.
        
         | zhihaojia wrote:
         | Thanks a lot for your positive feedback! We believe that MPK
         | can enhance existing LLM serving systems, especially for low-
         | latency LLM serving. We are very excited about the opportunity
         | to collaborate with others on direction.
        
       | skavi wrote:
       | Does anyone have an intuition on why this offers significant
       | gains over CUDA Graphs?. The CPU launch cost of a graph is tiny
       | which implies most of the work has been offloaded to the GPU's
       | own scheduler. I'd expect that some I/O marshalling at kernel
       | boundaries could be avoided with megakernels. Maybe some loop
       | fusion? Are there any more interesting optimizations they enable?
        
         | refulgentis wrote:
         | You've hit the nail on the head. The CPU launch cost of a pre-
         | compiled CUDA graph is _tiny._
         | 
         | CUDA Graphs are a huge step up from manually launching kernels,
         | but they still treat kernels as monolithic, black-box
         | operations. A megakernel erases the boundaries between those
         | operations.
         | 
         | With CUDA Graphs, as in the example in the article, if you have
         | Matmul -> AllReduce, the AllReduce kernel cannot start until
         | the entire Matmul kernel has finished. The dependency is at the
         | kernel level. With a megakernel, they break these ops into
         | fine-grained "tasks" scheduled across SMs. An AllReduce task
         | that needs data from the first slice of the Matmul can begin as
         | soon as that slice is computed by a few SMs, while other SMs
         | are still working on the rest of the Matmul. This fine-grained
         | software pipelining and compute/communication overlap is simply
         | not possible when the dependency unit is the entire kernel.
        
         | saagarjha wrote:
         | > The CPU launch cost of a graph is tiny
         | 
         | Absolutely not; it's comparable to the launch overhead of a
         | kernel.
        
       | flakiness wrote:
       | This project is from CMU. Hazy Research at Stanford talked about
       | the megakernel too:
       | https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles
       | 
       | Good to see the competition in this area.
       | 
       | (Edited): Related paper covering the larger "mirage" project, but
       | this doesn't cover the "megakernel" approach:
       | https://arxiv.org/abs/2405.05751
        
         | zhihaojia wrote:
         | This is the writer of the blog post. You are right that
         | Stanford's work is a parallel effort. The main difference is
         | that our focus is on compilation: making it easier to generate
         | megakernels automatically.
        
       | olivia111 wrote:
       | really cool. would love to try it for our 3b model.
        
       | olivia111 wrote:
       | any detailed tutorial about how to use it?
        
         | zhihaojia wrote:
         | The github repo includes a tutorial for using MPK:
         | https://github.com/mirage-project/mirage/tree/mpk
        
       | fxtentacle wrote:
       | Isn't fusing ops at a fine-grained level also the core benefit of
       | JAX over TensorFlow? How does this work compare to JAX?
        
       | bdbenton5255 wrote:
       | Certainly an important discovery for utilizing these models on
       | scaled hardware. This approach could certainly be applied beyond
       | LLMs to other types of neural networks. That would be an
       | interesting space to explore.
        
       ___________________________________________________________________
       (page generated 2025-06-19 23:00 UTC)