[HN Gopher] We Made CUDA Optimization Suck Less
       ___________________________________________________________________
        
       We Made CUDA Optimization Suck Less
        
       Author : jaberjaber23
       Score  : 26 points
       Date   : 2025-05-13 14:43 UTC (1 days ago)
        
 (HTM) web link (www.rightnowai.co)
 (TXT) w3m dump (www.rightnowai.co)
        
       | jaberjaber23 wrote:
       | We're RightNow AI. We built a tool that automatically profiles,
       | detects bottlenecks, and generates optimized CUDA kernels using
       | AI.
       | 
       | If you've written CUDA before, you know how it goes. You spend
       | hours tweaking memory access, digging through profiler dumps,
       | swapping out intrinsics, and praying it'll run faster. Most of
       | the time, you're guessing.
       | 
       | We got tired of it. So we built something that just works.
       | 
       | What RightNow AI Actually Does Prompt-based CUDA Kernel
       | Generation Describe what you want in plain English. Get fast,
       | optimized CUDA code back. No need to know the difference between
       | global and shared memory layouts.
       | 
       | Serverless GPU Profiling Run your code on real GPUs without
       | having local hardware. Get detailed reports about where it's slow
       | and why.
       | 
       | Performance Optimizations That Deliver Not vague advice like "try
       | more threads." We return rewritten code. Our users are seeing 2x
       | to 4x improvements out of the box. Some hit 20x.
       | 
       | Why We Built It We needed it for our own work. Our ML stack was
       | bottlenecked by GPU code we didn't have time to optimize.
       | Existing tools felt ancient. The workflow was slow, clunky, and
       | filled with trial and error.
       | 
       | We thought: what if I could just say "optimize this kernel for
       | A100" and get something useful?
       | 
       | So we built it.
       | 
       | RightNow AI is live. You can try it for freee:
       | https://www.rightnowai.co/
       | 
       | If you use it and hit something rough, tell us. We'll fix it.
        
         | paulirish wrote:
         | What does one of the GPU profiling reports look like?
         | 
         | Edit: oh is it this? https://youtu.be/b-yh3FFpSX8?t=28
        
       | PontifexCipher wrote:
       | No examples of before/after? Maybe I missed something.
        
       | godelski wrote:
       | I was expecting something like TensorRT or Triton, but found
       | "Vibe Coding"
       | 
       | The project seems very naive. CUDA programming sucks because
       | there's a lot of little gotchas and nuances that dramatically
       | change performance. These optimizations can also significantly
       | change between GPU architectures: you'll get different
       | performances out of Volta, Ampere, or Blackwell. Parallel
       | programming is hard in the first place, and it gets harder on
       | GPUs because of all these little intricacies. People that have
       | been doing CUDA programming for years are still learning new
       | techniques. It takes a _very_ different type of programming
       | skill. Like actually understanding that Knuth 's "premature
       | optimization is the root of evil" means "get a profiler" not
       | "don't optimize". All this is what makes writing good kernels
       | take so long. That's even after Nvidia engineers are spending
       | tons of time trying to simplify it.
       | 
       | So I'm not surprised people are getting 2x or 4x out of the box.
       | I'd expect that much if a person grabbed a profiler. I'd honestly
       | expect more if they spent a week or two with the documentation
       | and serious effort. But nothing in the landing page is convincing
       | me the LLM can actually significantly help. Maybe I'm wrong! But
       | it is unclear if the lead dev has significant CUDA experience.
       | And I don't want something that optimizes a kernel for an A100, I
       | want kernelS that are optimized for multiple architectures.
       | That's the hard part and all those little nuances are exactly
       | what LLM coding tends to be really bad at.
        
         | germanjoey wrote:
         | TBH, the 2x-4x improvement over a naive implementation that
         | they're bragging about sounded kinda pathetic to me! I mean, it
         | depends greatly on the kernel itself and the target arch, but
         | I'm also assuming that the 2x-4x number is their best case
         | scenario. Whereas the best case for hand-optimized could be in
         | the tens or even hundreds of X.
        
       | cjbgkagh wrote:
       | The website appears vibe coded, as do the product-hunt reviews
       | with "RightNow AI is an impressive..." appearing more than would
       | be expected by random chance.
       | 
       | Either someone is good at writing CUDA Kernels and a 1-10% perf
       | improvement is impressive, or they're bad at writing CUDA Kernels
       | and a 2x-4x over naive very often isn't impressive.
       | 
       | What percentage of people who do write custom CUDA kernels are
       | bad at it? How many are so bad at it that they leave 20x on the
       | table as claimed on the website?
       | 
       | What could have helped sell it to me as a concept is an example
       | of a before and after.
       | 
       | EDIT: One of the reviews states "RightNow AI is an innovative
       | tool designed to help developers profile and optimize CUDA code
       | efficiently. Users have praised its ability to identify
       | bottlenecks and enhance GPU performance. For example, one user
       | stated, "RightNow AI is a game-changer for GPU optimization."" I
       | think some of the AI prompt has leaked into the output.
        
       ___________________________________________________________________
       (page generated 2025-05-14 23:00 UTC)