[HN Gopher] Surprisingly Fast AI-Generated Kernels We Didn't Mea...
___________________________________________________________________
Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish
(Yet)
Author : mfiguiere
Score : 124 points
Date : 2025-05-30 20:03 UTC (2 hours ago)
(HTM) web link (crfm.stanford.edu)
(TXT) w3m dump (crfm.stanford.edu)
| yahoozoo wrote:
| Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they
| don't mention which one produced the better kernels.
| reliabilityguy wrote:
| Is my understanding correct that they assumed a fixed size of the
| input?
|
| If so, why is it surprising that generic implementations in
| PyTorch are worse?
| GaggiX wrote:
| Pytorch uses different kernels depending on the input size.
| There is a reason why it's so massive to download.
| reliabilityguy wrote:
| Sure, some degree of customization is expected. However, I
| doubt that PyTorch implements _every_ input size separately.
| Workaccount2 wrote:
| Very fascinating result, and it seems they wrote this blog post
| out of pure excitement to share their findings, and maybe to have
| someone throw cold water on it before publishing, ha.
|
| Who knows if this is the actual fabled path of "self
| improvement", but results like this are what we expect to find on
| such a path.
| suddenlybananas wrote:
| > Who knows if this is the actual fabled path of "self
| improvement"
|
| Seems doubtful as this works only on an extremely well-defined
| evaluation function.
| EMIRELADERO wrote:
| That may be true, but this is the first example I've seen
| where the concept is successfully implemented in a noticeable
| way.
|
| It's just like image generation: the first iteration is the
| worst it will ever be.
| observationist wrote:
| Each time you define another task well enough for the system
| to work, you generalize the system just a little bit - repeat
| enough times and you can start to expand, develop taxonomies
| of functions, precisely define function spaces and metrics
| for improvement. This might not be a bootstrap for recursive
| self improvement generally, but it could definitely inform
| the theory or design of a system that does bootstrap rsi.
| suddenlybananas wrote:
| That's an entirely different idea that may or may not work.
| This is not evidence of that.
| observationist wrote:
| The structure of their research - the process, the
| specific task, and the data they generate - will help
| inform how other research gets performed. Instead of GPU
| kernels, maybe the next task is something like neuron
| modules, looking for structures that improve on attention
| blocks, or things like that - each time you run through
| an experiment like this, you're creating foundational
| data upon which other experiments can be run and
| improved. Once you've done enough of them, you can
| generalize.
|
| It could be that the end result is the knowledge of
| strict boundaries of LLM capabilities, that they can only
| operate in specific domains, or only improve to a certain
| extent, and some currently unspecified defect limits the
| level of improvement.
|
| The underlying idea of specifying a domain and task
| conditions, then letting an LLM run thousands of
| experiments, is a great search technique. The hope is
| that there is no implicit defect and that the methodology
| will extend and generalize - it's not too complex a
| notion to think that you could have an LLM create a broad
| range of individual tasks, with a meta-goal of
| identifying better and more general recursive improvement
| processes and algorithms.
| suddenlybananas wrote:
| >The hope is that there is no implicit defect and that
| the methodology will extend and generalize - it's not too
| complex a notion to think that you could have an LLM
| create a broad range of individual tasks, with a meta-
| goal of identifying better and more general recursive
| improvement processes and algorithms
|
| Again, entirely different idea that doesn't have a
| straightforward evaluation function. As it stands, this
| is more akin to genetic programming with a very good
| mutation function.
| thorum wrote:
| My takeaway - from this, Google's AlphaEvolve, and the recent
| announcement about o3 finding a zero day in the Linux kernel - is
| that Gemini Pro 2.5 and o3 in particular have reached a new level
| of capability where these ideas that were tried unsuccessfully
| with other models, suddenly just work.
| jiggawatts wrote:
| Gemini Pro 2.5 is the first AI that I can productively use for
| anything other than human language translation, but it's just
| _barely_ crossed that threshold. Sometimes I get success hit
| rates below 20%.
|
| When 3.0 comes out, that... that's going to start getting a
| little scary.
| zozbot234 wrote:
| Wait, what are you saying? These have nothing to do with the
| Linux kernel whatsoever, they are "kernels" in the GPU
| programming sense. Did you just hallucinate this whole comment
| or what?
| None4U wrote:
| There was a post on HN a bit ago from someone who used o3 to
| find a vulnerability in the Linux kernel's SMB server, which
| this person is just saying should've been tried earlier and
| probably recently became possible
| therealpygon wrote:
| In my opinion, I wouldn't say so much that they are suddenly
| working. Rather we've reached a point where they can iterate
| and test significantly faster than humans are capable of doing
| and have the ability to call on significantly more immediately
| available information that it can make sense of, and as a
| result, the combination information, advancement and
| intelligently applied brute force seems to be having success in
| certain applications.
| brrrrrm wrote:
| what's going to be interesting is to see the large space of fused
| kernels being tackled by AI generated code. that might include
| gemm + relu + gemm + a norm of some kind - which would be
| annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite
| as a human
| ekelsen wrote:
| "FP32 is less common in modern ML workloads and often less
| optimized on recent hardware compared to FP16 or BF16, which may
| partly explain why it's easier to achieve performance gains over
| PyTorch with FP32 kernels."
|
| People haven't spent time optimizing the fp32 versions of these
| kernels in years. This will be much more interesting if they can
| improve the kernels where developer effort has gone and that are
| actually used.
| suddenlybananas wrote:
| I wonder if it's using known improvements from the fp16/bf16
| kernels that are transferable to fp32?
| moralestapia wrote:
| >People haven't spent time optimizing the fp32 versions of
| these kernels in years.
|
| Wow, so, you're basically saying the AI created new algos in a
| domain with no pre-existing solutions? Awesome!
| adityamwagh wrote:
| Sometimes I think of LLMs as kind of a hive mind. It's trained on
| thought processes of so many humans. I think that's why it's able
| to do these kinds of things given the fact that it has so much
| information and context compressed in weights.
| MangoToupe wrote:
| The market itself is also kind of a hive-mind metaphor. Worth
| thinking about.
| suddenlybananas wrote:
| Maybe we could replace it with a central planning now that we
| can distill information.
| MangoToupe wrote:
| Whoops you just did a communism
| gpm wrote:
| A "vertical integration" in the capitalist world ;)
| MangoToupe wrote:
| This got a legitimate chortle out of me
| constantcrying wrote:
| >and test for correctness by checking the numerical equality of
| the two outputs over many random inputs.
|
| This is fundamentally different to how any human would approach
| this problem. And also different to how some recent advances in
| this area were made, where AI actually came up with superior and
| correct algorithms.
|
| This approach also seems quite unfortunate and makes many of
| theses results somewhat doubtful.
| gotoeleven wrote:
| How else would you do the verification?
| ekelsen wrote:
| "the reference code is in the default FP32, and given a tolerance
| threshold (1e-02)"
|
| that's a huge tolerance and allows them to use fp16 operations to
| replace the "fp32" kernel.
___________________________________________________________________
(page generated 2025-05-30 23:00 UTC)