[HN Gopher] Surprisingly Fast AI-Generated Kernels We Didn't Mea...
       ___________________________________________________________________
        
       Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish
       (Yet)
        
       Author : mfiguiere
       Score  : 124 points
       Date   : 2025-05-30 20:03 UTC (2 hours ago)
        
 (HTM) web link (crfm.stanford.edu)
 (TXT) w3m dump (crfm.stanford.edu)
        
       | yahoozoo wrote:
       | Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they
       | don't mention which one produced the better kernels.
        
       | reliabilityguy wrote:
       | Is my understanding correct that they assumed a fixed size of the
       | input?
       | 
       | If so, why is it surprising that generic implementations in
       | PyTorch are worse?
        
         | GaggiX wrote:
         | Pytorch uses different kernels depending on the input size.
         | There is a reason why it's so massive to download.
        
           | reliabilityguy wrote:
           | Sure, some degree of customization is expected. However, I
           | doubt that PyTorch implements _every_ input size separately.
        
       | Workaccount2 wrote:
       | Very fascinating result, and it seems they wrote this blog post
       | out of pure excitement to share their findings, and maybe to have
       | someone throw cold water on it before publishing, ha.
       | 
       | Who knows if this is the actual fabled path of "self
       | improvement", but results like this are what we expect to find on
       | such a path.
        
         | suddenlybananas wrote:
         | > Who knows if this is the actual fabled path of "self
         | improvement"
         | 
         | Seems doubtful as this works only on an extremely well-defined
         | evaluation function.
        
           | EMIRELADERO wrote:
           | That may be true, but this is the first example I've seen
           | where the concept is successfully implemented in a noticeable
           | way.
           | 
           | It's just like image generation: the first iteration is the
           | worst it will ever be.
        
           | observationist wrote:
           | Each time you define another task well enough for the system
           | to work, you generalize the system just a little bit - repeat
           | enough times and you can start to expand, develop taxonomies
           | of functions, precisely define function spaces and metrics
           | for improvement. This might not be a bootstrap for recursive
           | self improvement generally, but it could definitely inform
           | the theory or design of a system that does bootstrap rsi.
        
             | suddenlybananas wrote:
             | That's an entirely different idea that may or may not work.
             | This is not evidence of that.
        
               | observationist wrote:
               | The structure of their research - the process, the
               | specific task, and the data they generate - will help
               | inform how other research gets performed. Instead of GPU
               | kernels, maybe the next task is something like neuron
               | modules, looking for structures that improve on attention
               | blocks, or things like that - each time you run through
               | an experiment like this, you're creating foundational
               | data upon which other experiments can be run and
               | improved. Once you've done enough of them, you can
               | generalize.
               | 
               | It could be that the end result is the knowledge of
               | strict boundaries of LLM capabilities, that they can only
               | operate in specific domains, or only improve to a certain
               | extent, and some currently unspecified defect limits the
               | level of improvement.
               | 
               | The underlying idea of specifying a domain and task
               | conditions, then letting an LLM run thousands of
               | experiments, is a great search technique. The hope is
               | that there is no implicit defect and that the methodology
               | will extend and generalize - it's not too complex a
               | notion to think that you could have an LLM create a broad
               | range of individual tasks, with a meta-goal of
               | identifying better and more general recursive improvement
               | processes and algorithms.
        
               | suddenlybananas wrote:
               | >The hope is that there is no implicit defect and that
               | the methodology will extend and generalize - it's not too
               | complex a notion to think that you could have an LLM
               | create a broad range of individual tasks, with a meta-
               | goal of identifying better and more general recursive
               | improvement processes and algorithms
               | 
               | Again, entirely different idea that doesn't have a
               | straightforward evaluation function. As it stands, this
               | is more akin to genetic programming with a very good
               | mutation function.
        
       | thorum wrote:
       | My takeaway - from this, Google's AlphaEvolve, and the recent
       | announcement about o3 finding a zero day in the Linux kernel - is
       | that Gemini Pro 2.5 and o3 in particular have reached a new level
       | of capability where these ideas that were tried unsuccessfully
       | with other models, suddenly just work.
        
         | jiggawatts wrote:
         | Gemini Pro 2.5 is the first AI that I can productively use for
         | anything other than human language translation, but it's just
         | _barely_ crossed that threshold. Sometimes I get success hit
         | rates below 20%.
         | 
         | When 3.0 comes out, that... that's going to start getting a
         | little scary.
        
         | zozbot234 wrote:
         | Wait, what are you saying? These have nothing to do with the
         | Linux kernel whatsoever, they are "kernels" in the GPU
         | programming sense. Did you just hallucinate this whole comment
         | or what?
        
           | None4U wrote:
           | There was a post on HN a bit ago from someone who used o3 to
           | find a vulnerability in the Linux kernel's SMB server, which
           | this person is just saying should've been tried earlier and
           | probably recently became possible
        
         | therealpygon wrote:
         | In my opinion, I wouldn't say so much that they are suddenly
         | working. Rather we've reached a point where they can iterate
         | and test significantly faster than humans are capable of doing
         | and have the ability to call on significantly more immediately
         | available information that it can make sense of, and as a
         | result, the combination information, advancement and
         | intelligently applied brute force seems to be having success in
         | certain applications.
        
       | brrrrrm wrote:
       | what's going to be interesting is to see the large space of fused
       | kernels being tackled by AI generated code. that might include
       | gemm + relu + gemm + a norm of some kind - which would be
       | annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite
       | as a human
        
       | ekelsen wrote:
       | "FP32 is less common in modern ML workloads and often less
       | optimized on recent hardware compared to FP16 or BF16, which may
       | partly explain why it's easier to achieve performance gains over
       | PyTorch with FP32 kernels."
       | 
       | People haven't spent time optimizing the fp32 versions of these
       | kernels in years. This will be much more interesting if they can
       | improve the kernels where developer effort has gone and that are
       | actually used.
        
         | suddenlybananas wrote:
         | I wonder if it's using known improvements from the fp16/bf16
         | kernels that are transferable to fp32?
        
         | moralestapia wrote:
         | >People haven't spent time optimizing the fp32 versions of
         | these kernels in years.
         | 
         | Wow, so, you're basically saying the AI created new algos in a
         | domain with no pre-existing solutions? Awesome!
        
       | adityamwagh wrote:
       | Sometimes I think of LLMs as kind of a hive mind. It's trained on
       | thought processes of so many humans. I think that's why it's able
       | to do these kinds of things given the fact that it has so much
       | information and context compressed in weights.
        
         | MangoToupe wrote:
         | The market itself is also kind of a hive-mind metaphor. Worth
         | thinking about.
        
           | suddenlybananas wrote:
           | Maybe we could replace it with a central planning now that we
           | can distill information.
        
             | MangoToupe wrote:
             | Whoops you just did a communism
        
               | gpm wrote:
               | A "vertical integration" in the capitalist world ;)
        
               | MangoToupe wrote:
               | This got a legitimate chortle out of me
        
       | constantcrying wrote:
       | >and test for correctness by checking the numerical equality of
       | the two outputs over many random inputs.
       | 
       | This is fundamentally different to how any human would approach
       | this problem. And also different to how some recent advances in
       | this area were made, where AI actually came up with superior and
       | correct algorithms.
       | 
       | This approach also seems quite unfortunate and makes many of
       | theses results somewhat doubtful.
        
         | gotoeleven wrote:
         | How else would you do the verification?
        
       | ekelsen wrote:
       | "the reference code is in the default FP32, and given a tolerance
       | threshold (1e-02)"
       | 
       | that's a huge tolerance and allows them to use fp16 operations to
       | replace the "fp32" kernel.
        
       ___________________________________________________________________
       (page generated 2025-05-30 23:00 UTC)