[HN Gopher] We Made CUDA Optimization Suck Less
___________________________________________________________________
We Made CUDA Optimization Suck Less
Author : jaberjaber23
Score : 26 points
Date : 2025-05-13 14:43 UTC (1 days ago)
(HTM) web link (www.rightnowai.co)
(TXT) w3m dump (www.rightnowai.co)
| jaberjaber23 wrote:
| We're RightNow AI. We built a tool that automatically profiles,
| detects bottlenecks, and generates optimized CUDA kernels using
| AI.
|
| If you've written CUDA before, you know how it goes. You spend
| hours tweaking memory access, digging through profiler dumps,
| swapping out intrinsics, and praying it'll run faster. Most of
| the time, you're guessing.
|
| We got tired of it. So we built something that just works.
|
| What RightNow AI Actually Does Prompt-based CUDA Kernel
| Generation Describe what you want in plain English. Get fast,
| optimized CUDA code back. No need to know the difference between
| global and shared memory layouts.
|
| Serverless GPU Profiling Run your code on real GPUs without
| having local hardware. Get detailed reports about where it's slow
| and why.
|
| Performance Optimizations That Deliver Not vague advice like "try
| more threads." We return rewritten code. Our users are seeing 2x
| to 4x improvements out of the box. Some hit 20x.
|
| Why We Built It We needed it for our own work. Our ML stack was
| bottlenecked by GPU code we didn't have time to optimize.
| Existing tools felt ancient. The workflow was slow, clunky, and
| filled with trial and error.
|
| We thought: what if I could just say "optimize this kernel for
| A100" and get something useful?
|
| So we built it.
|
| RightNow AI is live. You can try it for freee:
| https://www.rightnowai.co/
|
| If you use it and hit something rough, tell us. We'll fix it.
| paulirish wrote:
| What does one of the GPU profiling reports look like?
|
| Edit: oh is it this? https://youtu.be/b-yh3FFpSX8?t=28
| PontifexCipher wrote:
| No examples of before/after? Maybe I missed something.
| godelski wrote:
| I was expecting something like TensorRT or Triton, but found
| "Vibe Coding"
|
| The project seems very naive. CUDA programming sucks because
| there's a lot of little gotchas and nuances that dramatically
| change performance. These optimizations can also significantly
| change between GPU architectures: you'll get different
| performances out of Volta, Ampere, or Blackwell. Parallel
| programming is hard in the first place, and it gets harder on
| GPUs because of all these little intricacies. People that have
| been doing CUDA programming for years are still learning new
| techniques. It takes a _very_ different type of programming
| skill. Like actually understanding that Knuth 's "premature
| optimization is the root of evil" means "get a profiler" not
| "don't optimize". All this is what makes writing good kernels
| take so long. That's even after Nvidia engineers are spending
| tons of time trying to simplify it.
|
| So I'm not surprised people are getting 2x or 4x out of the box.
| I'd expect that much if a person grabbed a profiler. I'd honestly
| expect more if they spent a week or two with the documentation
| and serious effort. But nothing in the landing page is convincing
| me the LLM can actually significantly help. Maybe I'm wrong! But
| it is unclear if the lead dev has significant CUDA experience.
| And I don't want something that optimizes a kernel for an A100, I
| want kernelS that are optimized for multiple architectures.
| That's the hard part and all those little nuances are exactly
| what LLM coding tends to be really bad at.
| germanjoey wrote:
| TBH, the 2x-4x improvement over a naive implementation that
| they're bragging about sounded kinda pathetic to me! I mean, it
| depends greatly on the kernel itself and the target arch, but
| I'm also assuming that the 2x-4x number is their best case
| scenario. Whereas the best case for hand-optimized could be in
| the tens or even hundreds of X.
| cjbgkagh wrote:
| The website appears vibe coded, as do the product-hunt reviews
| with "RightNow AI is an impressive..." appearing more than would
| be expected by random chance.
|
| Either someone is good at writing CUDA Kernels and a 1-10% perf
| improvement is impressive, or they're bad at writing CUDA Kernels
| and a 2x-4x over naive very often isn't impressive.
|
| What percentage of people who do write custom CUDA kernels are
| bad at it? How many are so bad at it that they leave 20x on the
| table as claimed on the website?
|
| What could have helped sell it to me as a concept is an example
| of a before and after.
|
| EDIT: One of the reviews states "RightNow AI is an innovative
| tool designed to help developers profile and optimize CUDA code
| efficiently. Users have praised its ability to identify
| bottlenecks and enhance GPU performance. For example, one user
| stated, "RightNow AI is a game-changer for GPU optimization."" I
| think some of the AI prompt has leaked into the output.
___________________________________________________________________
(page generated 2025-05-14 23:00 UTC)