[HN Gopher] PyTorch Helion
___________________________________________________________________
PyTorch Helion
Author : jarbus
Score : 126 points
Date : 2025-11-02 06:21 UTC (5 days ago)
(HTM) web link (pytorch.org)
(TXT) w3m dump (pytorch.org)
| dachworker wrote:
| I'm super excited to give this one a spin. It seems like a neat
| idea, Triton, but simpler and with automatic autotuning. My head
| is spinning with options right now. I love how everyone was
| hyping up CUDA this and CUDA that a couple of years ago, and now
| CUDA is all but irrelevant. There's now so many different and
| opinionated takes on how you should write high performant
| accelerator cluster code. I love it.
|
| It's also kinda of ironic that right now in 2025, we have all
| this diversity in tooling, but at the same time, the ML
| architecture space has collapsed entirely and everyone is just
| using transformers.
| almostgotcaught wrote:
| > and now CUDA is all but irrelevant.
|
| Lol this is so wrong it's cringe.
|
| > There's now so many different and opinionated takes on how
| you should write high performant accelerator cluster code. I
| love it.
|
| There are literally only 2: SIMT (ie the same as it always was)
| and tiles (ie Triton). That's it. Helion is just Triton with
| _more_ auto-tuning (Triton already has auto-tuning).
| the__alchemist wrote:
| Even for non-ML things like chem simulations: CUDA (and
| cuFFT) are more pleasant to use than Vulkan Compute and
| vkFFT.
| ozgrakkurt wrote:
| I just learned the graphics api of vulkan, can't imagine
| anything being less pleasant than vulkan
| porridgeraisin wrote:
| Yeah it's quite something. If anyone wants a preview,
| here's the triangle hello world in vulkan: https://gist.g
| ithub.com/Overv/7ac07356037592a121225172d7d78f...
|
| But then again, I've heard that it's this low level
| because its meant for engine developers.
| simlevesque wrote:
| Oh wow that's horrible.
| anvuong wrote:
| Really? How low is this level actually? Because I
| remember my OpenGL class' professor did this in less than
| 50 lines.
| porridgeraisin wrote:
| Imagine writing GlCreateContext yourself, for starters,
| as has been done in the link I posted.
| embedding-shape wrote:
| > CUDA that a couple of years ago, and now CUDA is all but
| irrelevant
|
| What? CUDA won't be irrelevant for years even if all the
| competitors figure out the holy grail, the ecosystem doesn't
| suddenly migrate over night. People learning CUDA today will
| continue to be find jobs and opportunities across the sector
| for the near future without any worries.
|
| > but at the same time, the ML architecture space has collapsed
| entirely and everyone is just using transformers.
|
| That's also not true, the ML space is still growing, and lots
| of things outside of Transformers, but it requires you to
| actually look and pay attention, not just browse the HN and
| r/localllama frontpage.
|
| Overall, these do not seem to be the sentiments coming from
| someone inside the ML space, but rather from an onlookers
| perspective.
| pjmlp wrote:
| In what alternative reality is that the case?
| brap wrote:
| Asking as someone who is really out of the loop: how much of ML
| development these days touches these "lower level" parts of the
| stack? I'd expect that by now most of the work would be high
| level, and the infra would be mostly commoditized.
| brrrrrm wrote:
| a recent wave of interest in bitwise equivalent execution had a
| lot of kernels this level get pumped out.
|
| new attention mechanisms also often need new kernels to run at
| any reasonable rate
|
| theres definitely a breed of frontend-only ML dev that
| dominates the space, but a lot novel exploration needs new
| kernels
| embedding-shape wrote:
| > how much of ML development these days touches these "lower
| level" parts of the stack? I'd expect that by now most of the
| work would be high level
|
| Every time the high level architectures of models change, there
| are new lower level optimizations to be done. Even recent
| releases like GPT-OSS adds new areas for improvements, like
| MXFP4, that requires the lower level parts to created and
| optimized.
| westurner wrote:
| How often do hardware optimizations get created for lower
| level optimization of LLMs and Tensor physics? How
| reconfigurable are TPUs? Are there any standardized feature
| flags for TPUs yet?
|
| Is TOPS/Whr a good efficiency metric for TPUs and for LLM
| model hosting operations?
|
| From https://news.ycombinator.com/item?id=45775181 re:
| current TPUs in 2025; "AI accelerators" :
|
| > _How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM
| compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU,
| Tenstorrent's, and AMD's NPU designs?_
| anvuong wrote:
| There are some not so niche communities, like FlashAttention
| and LinearFlashAttention repos. New code/optimizations get
| committed on a weekly basis. They find a couple of percents
| here and there all the time. How useful their kernels actually
| are in term of producing good results remain to be seen, but
| their implementations are often much better (in FLOPS) compared
| to what were proposed in the original papers.
|
| It's just like game optimization, cache-friendliness and memory
| hierarchy-awareness are huge in attention mechanism. But
| programming backward pass in these lower-level stacks is
| definitely not fun, tensor calculus breaks my brain.
| markush_ wrote:
| Interesting choice from PyTorch to release yet another DSL, on
| positive side it's one more point in the design space on the
| other hand it's even more difficult to choose the right
| technology among Triton, Gluon, CuTe, ThunderKittens and a few
| others.
| bobajeff wrote:
| It's good to see more effort for making things not device
| specific but I only see benchmarks for NVIDIA B200 and AMD
| MI350X. Also what's the experience of using one of these Python
| DSLs like? Are the tools good enough to make code completion,
| jump to definition, setting breakpoints, watching variables,
| copying as expression etc. nice?
| saagarjha wrote:
| Generally you are unlikely to get Python-level debugging for
| code that is going to run on GPUs.
| giovannibonetti wrote:
| That's Mojo's selling point.
|
| https://www.modular.com/mojo
| doctorpangloss wrote:
| Is contributing to Triton so bad? It looks like the blocker is
| usually LLVM.
| saagarjha wrote:
| It's not that bad, but I'm not sure why this is relevant?
| uoaei wrote:
| Tangential question related to the example kernel: in GPU
| programming is it idiomatic/standard to initialize the out array
| as zeros rather than empty? are the performance savings
| negligible?
| saagarjha wrote:
| It saves a kernel launch and memory bandwidth for a fill
| kernel. If you're going to overwrite the data anyway, why
| bother?
| porridgeraisin wrote:
| They have made it empty only.
|
| >> out = torch.empty([m, n], dtype=x.dtype, device=x.device)
|
| The accumulator has been initialized to zero, since well, they
| have to add stuff into it.
|
| >> acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
|
| > idiomatic
|
| No as far as I have seen they generally try to not initialize
| if its not necessary.
|
| > overhead
|
| There is the memory bandwidth point as you might expect. But
| additionally when using high level interfaces like pytorch,
| when you write torch.zeros(512, 512) in pytorch, it launches a
| whole kernel (tens of micros) just for that line. So that's cpu
| -> gpu -> back to cpu, and then it does the next line, where it
| goes to gpu again and uses that memory. So in these cases you
| make sure to avoid it if its in a hot path. Ideally you want
| the 2nd kernel to do the initialization itself. When you write
| cuda c++ yourself this is how you typically do it. Helion being
| a compiler might be doing this optimization, but runtime based
| torch can't clearly.
| darknoon wrote:
| The developers also gave a talk about Helion on GPU Mode:
| https://www.youtube.com/watch?v=1zKvCLuvUYc
| bwfan123 wrote:
| I dont get the point of helion as compared to its alternatives
| like gluon.
|
| For best performance I would presume one needs low-level access
| to hardware knobs. And, these kernel primitives are written one-
| time and reused. So, what is the point of a DSL that dumbs things
| down as a wrapper around triton.
| krapht wrote:
| Funny, I feel the same way about Triton. Performant Triton
| looks like CUDA (but with tiles!) except it's ten times harder
| to debug since it doesn't have the tooling NVIDIA provides.
|
| If I had to run on AMD I'd rather deal with their hipify
| tooling.
| saagarjha wrote:
| Performant Triton programs are usually simpler and shorter
| than their CUDA equivalents. This alone makes it easier to
| write, and I would argue that it helps with debugging too
| because the model provides a lot more guarantees on how your
| code executes. That said, some of the tooling is notably poor
| (such as cuda-gdb support).
| krapht wrote:
| Agree on shorter, disagree on simpler. The hard part of
| understanding GPU code is knowing the reasons why
| algorithms are the way they are. For example, why we do a
| split-k decomposition when doing a matrix multiplication,
| or why are we loading this particular data into shared
| memory at this particular time, with some overlapping
| subset into registers.
|
| Getting rid of the for loop over an array index doesn't
| make it easier to understand the hard parts. Losing the
| developer perf and debug tooling is absolutely not worth
| the tradeoff.
|
| For me I'd rather deal with Jax or Numba, and if that still
| wasn't enough, I would jump straight to CUDA.
|
| It's possible I'm an old fogey with bias, though. It's true
| that I've spent a lot more time with CUDA than with the new
| DSLs on the block.
| saagarjha wrote:
| I don't think it is possible to write high performance
| code without understanding how the hardware works. I just
| think staring at code that coalesces your loads or
| swizzles your layouts for the hundredth time is a waste
| of screen space, though. Just let the compiler do it and
| when it gets it wrong then you can bust out the explicit
| code you were going to write in CUDA, anyway.
| singularity2001 wrote:
| Anything as long as I don't have to touch propriety cuda and mpx
| saagarjha wrote:
| You'll need an execution backend.
| jarbus wrote:
| I posted this 5 days ago, how did this resurface?
| koolba wrote:
| https://news.ycombinator.com/item?id=26998308
| mshockwave wrote:
| Is it normal to spend 10minutes on tuning nowadays? Do we need to
| spend another 10 minutes upon changing the code?
| anvuong wrote:
| You mean autotune? I think 10 minutes is pretty normal,
| torch.compile('max-autotune') can be much slower than that for
| large models.
| Mars008 wrote:
| Add to that it can be done only once by developers before
| distribution for major hardware. Configs saved. Then on
| client side selected.
| ballpug wrote:
| Compiling a kernel after assemblage in low-level object oriented
| languages either uses stable kernel or the cargo fuzzed
| raw_spinlock code.
|
| Helion abstracts syntax and design for calculating l-functions,
| which converts language in a kernel config.
| a-dub wrote:
| numba for gpu kernels... cool!
___________________________________________________________________
(page generated 2025-11-07 23:00 UTC)