[HN Gopher] PyTorch Helion
       ___________________________________________________________________
        
       PyTorch Helion
        
       Author : jarbus
       Score  : 126 points
       Date   : 2025-11-02 06:21 UTC (5 days ago)
        
 (HTM) web link (pytorch.org)
 (TXT) w3m dump (pytorch.org)
        
       | dachworker wrote:
       | I'm super excited to give this one a spin. It seems like a neat
       | idea, Triton, but simpler and with automatic autotuning. My head
       | is spinning with options right now. I love how everyone was
       | hyping up CUDA this and CUDA that a couple of years ago, and now
       | CUDA is all but irrelevant. There's now so many different and
       | opinionated takes on how you should write high performant
       | accelerator cluster code. I love it.
       | 
       | It's also kinda of ironic that right now in 2025, we have all
       | this diversity in tooling, but at the same time, the ML
       | architecture space has collapsed entirely and everyone is just
       | using transformers.
        
         | almostgotcaught wrote:
         | > and now CUDA is all but irrelevant.
         | 
         | Lol this is so wrong it's cringe.
         | 
         | > There's now so many different and opinionated takes on how
         | you should write high performant accelerator cluster code. I
         | love it.
         | 
         | There are literally only 2: SIMT (ie the same as it always was)
         | and tiles (ie Triton). That's it. Helion is just Triton with
         | _more_ auto-tuning (Triton already has auto-tuning).
        
           | the__alchemist wrote:
           | Even for non-ML things like chem simulations: CUDA (and
           | cuFFT) are more pleasant to use than Vulkan Compute and
           | vkFFT.
        
             | ozgrakkurt wrote:
             | I just learned the graphics api of vulkan, can't imagine
             | anything being less pleasant than vulkan
        
               | porridgeraisin wrote:
               | Yeah it's quite something. If anyone wants a preview,
               | here's the triangle hello world in vulkan: https://gist.g
               | ithub.com/Overv/7ac07356037592a121225172d7d78f...
               | 
               | But then again, I've heard that it's this low level
               | because its meant for engine developers.
        
               | simlevesque wrote:
               | Oh wow that's horrible.
        
               | anvuong wrote:
               | Really? How low is this level actually? Because I
               | remember my OpenGL class' professor did this in less than
               | 50 lines.
        
               | porridgeraisin wrote:
               | Imagine writing GlCreateContext yourself, for starters,
               | as has been done in the link I posted.
        
         | embedding-shape wrote:
         | > CUDA that a couple of years ago, and now CUDA is all but
         | irrelevant
         | 
         | What? CUDA won't be irrelevant for years even if all the
         | competitors figure out the holy grail, the ecosystem doesn't
         | suddenly migrate over night. People learning CUDA today will
         | continue to be find jobs and opportunities across the sector
         | for the near future without any worries.
         | 
         | > but at the same time, the ML architecture space has collapsed
         | entirely and everyone is just using transformers.
         | 
         | That's also not true, the ML space is still growing, and lots
         | of things outside of Transformers, but it requires you to
         | actually look and pay attention, not just browse the HN and
         | r/localllama frontpage.
         | 
         | Overall, these do not seem to be the sentiments coming from
         | someone inside the ML space, but rather from an onlookers
         | perspective.
        
         | pjmlp wrote:
         | In what alternative reality is that the case?
        
       | brap wrote:
       | Asking as someone who is really out of the loop: how much of ML
       | development these days touches these "lower level" parts of the
       | stack? I'd expect that by now most of the work would be high
       | level, and the infra would be mostly commoditized.
        
         | brrrrrm wrote:
         | a recent wave of interest in bitwise equivalent execution had a
         | lot of kernels this level get pumped out.
         | 
         | new attention mechanisms also often need new kernels to run at
         | any reasonable rate
         | 
         | theres definitely a breed of frontend-only ML dev that
         | dominates the space, but a lot novel exploration needs new
         | kernels
        
         | embedding-shape wrote:
         | > how much of ML development these days touches these "lower
         | level" parts of the stack? I'd expect that by now most of the
         | work would be high level
         | 
         | Every time the high level architectures of models change, there
         | are new lower level optimizations to be done. Even recent
         | releases like GPT-OSS adds new areas for improvements, like
         | MXFP4, that requires the lower level parts to created and
         | optimized.
        
           | westurner wrote:
           | How often do hardware optimizations get created for lower
           | level optimization of LLMs and Tensor physics? How
           | reconfigurable are TPUs? Are there any standardized feature
           | flags for TPUs yet?
           | 
           | Is TOPS/Whr a good efficiency metric for TPUs and for LLM
           | model hosting operations?
           | 
           | From https://news.ycombinator.com/item?id=45775181 re:
           | current TPUs in 2025; "AI accelerators" :
           | 
           | > _How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM
           | compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU,
           | Tenstorrent's, and AMD's NPU designs?_
        
         | anvuong wrote:
         | There are some not so niche communities, like FlashAttention
         | and LinearFlashAttention repos. New code/optimizations get
         | committed on a weekly basis. They find a couple of percents
         | here and there all the time. How useful their kernels actually
         | are in term of producing good results remain to be seen, but
         | their implementations are often much better (in FLOPS) compared
         | to what were proposed in the original papers.
         | 
         | It's just like game optimization, cache-friendliness and memory
         | hierarchy-awareness are huge in attention mechanism. But
         | programming backward pass in these lower-level stacks is
         | definitely not fun, tensor calculus breaks my brain.
        
       | markush_ wrote:
       | Interesting choice from PyTorch to release yet another DSL, on
       | positive side it's one more point in the design space on the
       | other hand it's even more difficult to choose the right
       | technology among Triton, Gluon, CuTe, ThunderKittens and a few
       | others.
        
       | bobajeff wrote:
       | It's good to see more effort for making things not device
       | specific but I only see benchmarks for NVIDIA B200 and AMD
       | MI350X. Also what's the experience of using one of these Python
       | DSLs like? Are the tools good enough to make code completion,
       | jump to definition, setting breakpoints, watching variables,
       | copying as expression etc. nice?
        
         | saagarjha wrote:
         | Generally you are unlikely to get Python-level debugging for
         | code that is going to run on GPUs.
        
           | giovannibonetti wrote:
           | That's Mojo's selling point.
           | 
           | https://www.modular.com/mojo
        
       | doctorpangloss wrote:
       | Is contributing to Triton so bad? It looks like the blocker is
       | usually LLVM.
        
         | saagarjha wrote:
         | It's not that bad, but I'm not sure why this is relevant?
        
       | uoaei wrote:
       | Tangential question related to the example kernel: in GPU
       | programming is it idiomatic/standard to initialize the out array
       | as zeros rather than empty? are the performance savings
       | negligible?
        
         | saagarjha wrote:
         | It saves a kernel launch and memory bandwidth for a fill
         | kernel. If you're going to overwrite the data anyway, why
         | bother?
        
         | porridgeraisin wrote:
         | They have made it empty only.
         | 
         | >> out = torch.empty([m, n], dtype=x.dtype, device=x.device)
         | 
         | The accumulator has been initialized to zero, since well, they
         | have to add stuff into it.
         | 
         | >> acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
         | 
         | > idiomatic
         | 
         | No as far as I have seen they generally try to not initialize
         | if its not necessary.
         | 
         | > overhead
         | 
         | There is the memory bandwidth point as you might expect. But
         | additionally when using high level interfaces like pytorch,
         | when you write torch.zeros(512, 512) in pytorch, it launches a
         | whole kernel (tens of micros) just for that line. So that's cpu
         | -> gpu -> back to cpu, and then it does the next line, where it
         | goes to gpu again and uses that memory. So in these cases you
         | make sure to avoid it if its in a hot path. Ideally you want
         | the 2nd kernel to do the initialization itself. When you write
         | cuda c++ yourself this is how you typically do it. Helion being
         | a compiler might be doing this optimization, but runtime based
         | torch can't clearly.
        
       | darknoon wrote:
       | The developers also gave a talk about Helion on GPU Mode:
       | https://www.youtube.com/watch?v=1zKvCLuvUYc
        
       | bwfan123 wrote:
       | I dont get the point of helion as compared to its alternatives
       | like gluon.
       | 
       | For best performance I would presume one needs low-level access
       | to hardware knobs. And, these kernel primitives are written one-
       | time and reused. So, what is the point of a DSL that dumbs things
       | down as a wrapper around triton.
        
         | krapht wrote:
         | Funny, I feel the same way about Triton. Performant Triton
         | looks like CUDA (but with tiles!) except it's ten times harder
         | to debug since it doesn't have the tooling NVIDIA provides.
         | 
         | If I had to run on AMD I'd rather deal with their hipify
         | tooling.
        
           | saagarjha wrote:
           | Performant Triton programs are usually simpler and shorter
           | than their CUDA equivalents. This alone makes it easier to
           | write, and I would argue that it helps with debugging too
           | because the model provides a lot more guarantees on how your
           | code executes. That said, some of the tooling is notably poor
           | (such as cuda-gdb support).
        
             | krapht wrote:
             | Agree on shorter, disagree on simpler. The hard part of
             | understanding GPU code is knowing the reasons why
             | algorithms are the way they are. For example, why we do a
             | split-k decomposition when doing a matrix multiplication,
             | or why are we loading this particular data into shared
             | memory at this particular time, with some overlapping
             | subset into registers.
             | 
             | Getting rid of the for loop over an array index doesn't
             | make it easier to understand the hard parts. Losing the
             | developer perf and debug tooling is absolutely not worth
             | the tradeoff.
             | 
             | For me I'd rather deal with Jax or Numba, and if that still
             | wasn't enough, I would jump straight to CUDA.
             | 
             | It's possible I'm an old fogey with bias, though. It's true
             | that I've spent a lot more time with CUDA than with the new
             | DSLs on the block.
        
               | saagarjha wrote:
               | I don't think it is possible to write high performance
               | code without understanding how the hardware works. I just
               | think staring at code that coalesces your loads or
               | swizzles your layouts for the hundredth time is a waste
               | of screen space, though. Just let the compiler do it and
               | when it gets it wrong then you can bust out the explicit
               | code you were going to write in CUDA, anyway.
        
       | singularity2001 wrote:
       | Anything as long as I don't have to touch propriety cuda and mpx
        
         | saagarjha wrote:
         | You'll need an execution backend.
        
       | jarbus wrote:
       | I posted this 5 days ago, how did this resurface?
        
         | koolba wrote:
         | https://news.ycombinator.com/item?id=26998308
        
       | mshockwave wrote:
       | Is it normal to spend 10minutes on tuning nowadays? Do we need to
       | spend another 10 minutes upon changing the code?
        
         | anvuong wrote:
         | You mean autotune? I think 10 minutes is pretty normal,
         | torch.compile('max-autotune') can be much slower than that for
         | large models.
        
           | Mars008 wrote:
           | Add to that it can be done only once by developers before
           | distribution for major hardware. Configs saved. Then on
           | client side selected.
        
       | ballpug wrote:
       | Compiling a kernel after assemblage in low-level object oriented
       | languages either uses stable kernel or the cargo fuzzed
       | raw_spinlock code.
       | 
       | Helion abstracts syntax and design for calculating l-functions,
       | which converts language in a kernel config.
        
       | a-dub wrote:
       | numba for gpu kernels... cool!
        
       ___________________________________________________________________
       (page generated 2025-11-07 23:00 UTC)