[HN Gopher] Compiling LLMs into a MegaKernel: A path to low-late...
___________________________________________________________________
Compiling LLMs into a MegaKernel: A path to low-latency inference
Author : matt_d
Score : 95 points
Date : 2025-06-19 19:20 UTC (3 hours ago)
(HTM) web link (zhihaojia.medium.com)
(TXT) w3m dump (zhihaojia.medium.com)
| NitroPython wrote:
| Ollama integration?
| baq wrote:
| Next step - compile straight to verilog so I can buy some LLMs on
| aliexpress
| bigcat12345678 wrote:
| https://riscv.org/blog/2021/02/hardware-description-language...
| That was one of the promising ideas before AI & GPUs come to
| the scene. As CPUs are stagnant, and naturally people want
| further optimize the middle layers software and hardware.
|
| But I suspect parallel computing in GPU style is going to
| dominate acclerated computing.
|
| General purpose CPUs are going to stay to become the little
| brain that orchestrates GPUs.
|
| Ideas of software direct to hardware transition might never be
| the mainstream.
| baq wrote:
| I'm thinking more like pseudointellect over serial to attach
| a $3 esp32 to. Since it's basically tokens in, tokens out,
| let's just cut the unnecessary parts out. It's like querying
| the cloud models, except it's your silicon you personally
| soldered to the esp so nobody will break your home assistant
| with a system prompt update or a fine tuning run.
| scotty79 wrote:
| > Traditional LLM systems often rely on sequences of GPU kernel
| launches and external communication calls, resulting in
| underutilized hardware.
|
| What? Why? This seems like an obvious optimization if it's
| possible.
| shawntan wrote:
| Systems might want to anticipate changes in LLM architectures
| (even small changes can make a big difference kernel wise), so
| it's good to not "bake" too much in ahead of time.
|
| That said, at some point it just depends where the costs lie
| and it might make sense hiring some GPU engineers to do what
| they did here for whatever architecture you're optimising for.
|
| Not as low-hanging as you might imagine.
| catlifeonmars wrote:
| From the article
|
| > Despite these advantages, compiling an LLM into a megakernel
| is highly challenging. Existing high-level ML frameworks --
| such as PyTorch, Triton, and TVM -- do not natively support
| end-to-end megakernel generation. Additionally, modern LLM
| systems are built from a diverse collection of specialized
| kernel libraries: NCCL or NVSHMEM for communication, FlashInfer
| or FlashAttention for efficient attention, and CUDA or Triton
| for custom computation. This fragmentation makes it difficult
| to consolidate the entire inference pipeline into a single,
| unified kernel.
|
| So my naive assumption is that yes it is obvious, but
| nontrivial.
| saagarjha wrote:
| Your naive assumption is the right one. It's quite hard to do
| this. Even doing it automatically like it's done here runs
| into problems with trying to figure out data dependencies and
| synchronization across nontrivial computation.
| liuliu wrote:
| It really is not obvious. These launches are asynchronous, and
| data movement / computation is overlapped properly through CUDA
| APIs. Even per-kernel launch cost is reduced with the cudagraph
| introduction.
|
| CUDA programming model relies on each kernel to be
| computationally expensive to make sense, and these are not true
| for token generation of LLM. And we are talking about network
| evaluation at higher than 1000 per second, whereas previously
| besides recommendation systems, network evaluation we are look
| at is ~100 per second at most.
|
| Also, nobody remember Alex's "One Weird Trick" paper, which
| slices matmul into pieces to overlap device-to-device transfer
| v.s. computation. That is 10 years ago.
| delusional wrote:
| In the common case where the processor dispatching those kernel
| calls is much faster than the kernel calls themselves, you're
| not likely to see a meaningful increase in throughput.
|
| What you need to do first is get really optimized kernels
| (since that makes the dispatching relatively more expensive)
| and THEN this becomes worth doing. People who are really good a
| writing optimized GPU kernels are just not that easy to get a
| hold of right now.
| bytepoet wrote:
| This is very cool. I enjoyed going through the writeup and GitHub
| README.
|
| I was wondering if these same optimizations can be brought to
| bear on training as well, rather than only inference. I guess the
| challenge here is fusing backward computations with gradient
| communication.
|
| I also saw that this currently does not handle dynamic workloads
| such as MoE. I recently came across this paper that does exactly
| this:
|
| FlashDMoE: Fast Distributed MoE in a Single Kernel -
| https://arxiv.org/pdf/2506.04667
| zhihaojia wrote:
| Thanks for reading the post and github README. Supporting
| training is definitely feasible but the benefit may not be as
| significant as low-latency inference since training generally
| involves much larger kernels, making kernel launch overhead
| less significant.
|
| Thanks for sharing the FlashDMoE work. Our next step is to
| support MoE models. Stay tuned!
| liuliu wrote:
| The Qwen 8B number, if verified, is very impressive. Much more
| practical than the previous megakernel one.
|
| That's being said, these one-persisted kernel on each SM reminds
| me Larrabee, and now wondering what the world will be if we just
| do traditional process-thread-simd path rather than CUDA path.
| kp1197 wrote:
| After working pretty closely with vLLM and SGLang over the past
| few months, this is EXACTLY what I had envisioned what a
| successor project would look like - analyzing an operation
| dependency graph and then fusing (or, at a minimum, scheduling
| tasks smarter). Congrats to the team.
| zhihaojia wrote:
| Thanks a lot for your positive feedback! We believe that MPK
| can enhance existing LLM serving systems, especially for low-
| latency LLM serving. We are very excited about the opportunity
| to collaborate with others on direction.
| skavi wrote:
| Does anyone have an intuition on why this offers significant
| gains over CUDA Graphs?. The CPU launch cost of a graph is tiny
| which implies most of the work has been offloaded to the GPU's
| own scheduler. I'd expect that some I/O marshalling at kernel
| boundaries could be avoided with megakernels. Maybe some loop
| fusion? Are there any more interesting optimizations they enable?
| refulgentis wrote:
| You've hit the nail on the head. The CPU launch cost of a pre-
| compiled CUDA graph is _tiny._
|
| CUDA Graphs are a huge step up from manually launching kernels,
| but they still treat kernels as monolithic, black-box
| operations. A megakernel erases the boundaries between those
| operations.
|
| With CUDA Graphs, as in the example in the article, if you have
| Matmul -> AllReduce, the AllReduce kernel cannot start until
| the entire Matmul kernel has finished. The dependency is at the
| kernel level. With a megakernel, they break these ops into
| fine-grained "tasks" scheduled across SMs. An AllReduce task
| that needs data from the first slice of the Matmul can begin as
| soon as that slice is computed by a few SMs, while other SMs
| are still working on the rest of the Matmul. This fine-grained
| software pipelining and compute/communication overlap is simply
| not possible when the dependency unit is the entire kernel.
| saagarjha wrote:
| > The CPU launch cost of a graph is tiny
|
| Absolutely not; it's comparable to the launch overhead of a
| kernel.
| flakiness wrote:
| This project is from CMU. Hazy Research at Stanford talked about
| the megakernel too:
| https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles
|
| Good to see the competition in this area.
|
| (Edited): Related paper covering the larger "mirage" project, but
| this doesn't cover the "megakernel" approach:
| https://arxiv.org/abs/2405.05751
| zhihaojia wrote:
| This is the writer of the blog post. You are right that
| Stanford's work is a parallel effort. The main difference is
| that our focus is on compilation: making it easier to generate
| megakernels automatically.
| olivia111 wrote:
| really cool. would love to try it for our 3b model.
| olivia111 wrote:
| any detailed tutorial about how to use it?
| zhihaojia wrote:
| The github repo includes a tutorial for using MPK:
| https://github.com/mirage-project/mirage/tree/mpk
| fxtentacle wrote:
| Isn't fusing ops at a fine-grained level also the core benefit of
| JAX over TensorFlow? How does this work compare to JAX?
| bdbenton5255 wrote:
| Certainly an important discovery for utilizing these models on
| scaled hardware. This approach could certainly be applied beyond
| LLMs to other types of neural networks. That would be an
| interesting space to explore.
___________________________________________________________________
(page generated 2025-06-19 23:00 UTC)