[HN Gopher] Surprisingly fast AI-generated kernels we didn't mea...
___________________________________________________________________
Surprisingly fast AI-generated kernels we didn't mean to publish
yet
Author : mfiguiere
Score : 371 points
Date : 2025-05-30 20:03 UTC (1 days ago)
(HTM) web link (crfm.stanford.edu)
(TXT) w3m dump (crfm.stanford.edu)
| yahoozoo wrote:
| Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they
| don't mention which one produced the better kernels.
| reliabilityguy wrote:
| Is my understanding correct that they assumed a fixed size of the
| input?
|
| If so, why is it surprising that generic implementations in
| PyTorch are worse?
| GaggiX wrote:
| Pytorch uses different kernels depending on the input size.
| There is a reason why it's so massive to download.
| reliabilityguy wrote:
| Sure, some degree of customization is expected. However, I
| doubt that PyTorch implements _every_ input size separately.
| Workaccount2 wrote:
| Very fascinating result, and it seems they wrote this blog post
| out of pure excitement to share their findings, and maybe to have
| someone throw cold water on it before publishing, ha.
|
| Who knows if this is the actual fabled path of "self
| improvement", but results like this are what we expect to find on
| such a path.
| suddenlybananas wrote:
| > Who knows if this is the actual fabled path of "self
| improvement"
|
| Seems doubtful as this works only on an extremely well-defined
| evaluation function.
| EMIRELADERO wrote:
| That may be true, but this is the first example I've seen
| where the concept is successfully implemented in a noticeable
| way.
|
| It's just like image generation: the first iteration is the
| worst it will ever be.
| observationist wrote:
| Each time you define another task well enough for the system
| to work, you generalize the system just a little bit - repeat
| enough times and you can start to expand, develop taxonomies
| of functions, precisely define function spaces and metrics
| for improvement. This might not be a bootstrap for recursive
| self improvement generally, but it could definitely inform
| the theory or design of a system that does bootstrap rsi.
| suddenlybananas wrote:
| That's an entirely different idea that may or may not work.
| This is not evidence of that.
| observationist wrote:
| The structure of their research - the process, the
| specific task, and the data they generate - will help
| inform how other research gets performed. Instead of GPU
| kernels, maybe the next task is something like neuron
| modules, looking for structures that improve on attention
| blocks, or things like that - each time you run through
| an experiment like this, you're creating foundational
| data upon which other experiments can be run and
| improved. Once you've done enough of them, you can
| generalize.
|
| It could be that the end result is the knowledge of
| strict boundaries of LLM capabilities, that they can only
| operate in specific domains, or only improve to a certain
| extent, and some currently unspecified defect limits the
| level of improvement.
|
| The underlying idea of specifying a domain and task
| conditions, then letting an LLM run thousands of
| experiments, is a great search technique. The hope is
| that there is no implicit defect and that the methodology
| will extend and generalize - it's not too complex a
| notion to think that you could have an LLM create a broad
| range of individual tasks, with a meta-goal of
| identifying better and more general recursive improvement
| processes and algorithms.
| suddenlybananas wrote:
| >The hope is that there is no implicit defect and that
| the methodology will extend and generalize - it's not too
| complex a notion to think that you could have an LLM
| create a broad range of individual tasks, with a meta-
| goal of identifying better and more general recursive
| improvement processes and algorithms
|
| Again, entirely different idea that doesn't have a
| straightforward evaluation function. As it stands, this
| is more akin to genetic programming with a very good
| mutation function.
| thorum wrote:
| My takeaway - from this article, from Google's AlphaEvolve [1],
| and the recent announcement about o3 finding a zero day in the
| Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular
| have reached a new level of capability where these ideas that
| were tried unsuccessfully with other models, suddenly just work.
|
| [1] https://deepmind.google/discover/blog/alphaevolve-a-
| gemini-p...
|
| [2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-
| cve-...
| jiggawatts wrote:
| Gemini Pro 2.5 is the first AI that I can productively use for
| anything other than human language translation, but it's just
| _barely_ crossed that threshold. Sometimes I get success hit
| rates below 20%.
|
| When 3.0 comes out, that... that's going to start getting a
| little scary.
| jacob019 wrote:
| What domain?
| jiggawatts wrote:
| SRE / DevOps / coding mostly in the Azure and .NET
| ecosystems.
|
| The problems I have to solve tend to be the horrible ones
| that nobody has answers to, anywhere on the Internet, so
| unsurprisingly the AIs aren't good at it either.
|
| The trick has been to use the AIs for what they _are_ good
| that, which used to be "nothing" for me at least, but now
| I can use them productively for certain "spot" tasks.
|
| Random examples:
|
| - Cross-language and cross-platform benchmarking of a bunch
| of different database clients to see how they stack up. I
| gave the AI a working example in one language and got it to
| whip up a series of equivalents with other DB drivers and
| languages. Sure, it's trivial, but it's way faster than
| doing it myself!
|
| - Crash dump analysis using WinDbg. I read somwhere that
| "vibe debugging" of kernel dumps totally works, so when I
| had an actual crash I gave it a go for laughs. With AI help
| I managed to extract the name of the specific file that had
| NTFS corruption and was crashing the server. Deleted the
| file, restored it from backups, and the server was good to
| go again!
|
| - If you ever watch the top mechanical engineers on
| YouTube, they _all make their own tools_ instead of just
| buying them. Jigs, extenders, unusual sizes, etc... IT work
| is the same. As a recent example, I got Gemini to make me a
| code-AST rewriter for a specific issue I wanted to clean up
| in bulk across a huge code base. Using the Roslyn compiler
| SDK is a bit fiddly, but it spat out a working tool for me
| in under an hour. (This is not something you can solve with
| a script full of regex, it needed a proper parser to handle
| commented-out blocks and the like.)
| jacob019 wrote:
| Sounds like interesting work, thanks for sharing! "Vibe
| debugging", hah, I like that one. The latest crop of
| models is definately unlocking new capabilities, and I
| totally get the desire to make your own tools. I do that
| to a fault sometimes, but it's nice to have a simple tool
| that does exactly one thing, exactly the way you want it.
|
| I've been pair programming with the models for a while,
| and wrote some "agents" before I knew to call it that
| back in the dark days of GPT-3.5, but only recently with
| the latest models unlocking capabilities beyond what I
| could achieve with handwritten code.
| mholm wrote:
| > Sure, it's trivial, but it's way faster than doing it
| myself
|
| That's the clincher for me. So much software work is just
| excecuting on a design, not inventing anything new. Being
| able to do 5x the trivial work in an hour is life
| changing, and it lets me pull my head out of that work to
| see how I can make larger process improvements. AI
| doesn't need to rewrite the linux kernel in Rust to be
| extremely valuable to the average developer
| manmal wrote:
| o3 is in my experience often even better, but too slow and
| too rate limited to use it all the time.
| zozbot234 wrote:
| Wait, what are you saying? These have nothing to do with the
| Linux kernel whatsoever, they are "kernels" in the GPU
| programming sense. Did you just hallucinate this whole comment
| or what?
| None4U wrote:
| There was a post on HN a bit ago from someone who used o3 to
| find a vulnerability in the Linux kernel's SMB server, which
| this person is just saying should've been tried earlier and
| probably recently became possible
| thorum wrote:
| Sorry, I added links! Just a week ago someone built a system
| that used o3 to find novel zero days in the Linux kernel's
| SMB implementation.
| stefan_ wrote:
| Theres zero days in obscure parts of the kernel nobody uses
| every other day. (It also of course found 100 other things
| that were not zero days or vulnerabilities, yet professed
| they were, which is why this trash even on Gemini 9000 Pro
| keeps spamming security mails)
| therealpygon wrote:
| In my opinion, I wouldn't say so much that they are suddenly
| working. Rather we've reached a point where they can iterate
| and test significantly faster than humans are capable of doing
| and have the ability to call on significantly more immediately
| available information that it can make sense of, and as a
| result, the combination information, advancement and
| intelligently applied brute force seems to be having success in
| certain applications.
| thorum wrote:
| Good points. I suspect that o3 is able to reason more deeply
| about different paths through a codebase than earlier models,
| though, which might make it better at this kind of work in
| particular.
| therealpygon wrote:
| Very likely. Larger context is significantly beneficial to
| the LLMs when they can maintain attention, which was part
| of my point. Imagine being able to hold the word for word
| text of your required reading book while you are taking a
| test, while older models were more like a couple chapters
| worth of text. _Two_ years ago.
| westoncb wrote:
| I was blown away by some debugging results I got from o3
| early on and have been using it heavily since. The early
| results that caught my attention were from a couple cases
| where it tracked down some problematic cause through
| several indirect layers of effects in a way where you'd
| typically be tediously tracing step-by-step through a
| debugger. I think whatever's behind this capability has
| some overlap with really solid work it'll do in abstract
| system design, particularly in having it think through
| distant implications of design choices.
| notyourwork wrote:
| I'm interested in learning more about how you use o3 for
| debugging.
| westoncb wrote:
| The main trick is in how you build up it's context for
| the problem. What I do is think of it like a colleague
| I'm trying to explain the bug to: the overall structure
| is conversational, but I interleave both relevant source
| chunks and _detailed_ / _complete_ observational info
| from what I 've observed about anomalous program
| behavior. I typically will send a first message building
| up context about the program/source, and then build up
| the narrative context for particular bug in second
| message. This sets it up with basically perfect context
| to infer the problem, and sets you up for easy reuse: you
| can back up, clear that second message and ask something
| else, reusing detailed program context given by the first
| message.
|
| Using it on the architectural side you can follow a
| similar procedure but instead of describing a bug you're
| describing architectural revisions you've gone through,
| what your experience with each was, what your objectives
| with a potential refactor are, where your thinking's at
| as far as candidate reformulations, and so on. Then
| finish with a question that doesn't overly constrain the
| model; you might retry from that conversation/context
| point with a few variants, e.g.: "what are your thoughts
| on all this?" or "can you think of better primitives to
| express the system through?"
|
| I think there are two key points to doing this
| effectively:
|
| 1) Give it full, detailed context with nothing
| superfluous, and express it within the narrative of your
| real world situation.
|
| 2) Be careful not to "over-prescribe" what it says back
| to you. They are very "genie-like" where it'll often give
| exactly what you ask for in a rather literal sense, in
| incredibly dumb-seeming ways if you're not careful.
| MangoToupe wrote:
| In the context of LLMs, what do you mean by "reason"? What
| does reasoning look like in LLMs and how do you recognize
| it, and more importantly, how do you invoke it? I haven't
| had much success in getting LLMs to solve, well, basically
| any problem that involves logic.
|
| Chain of thought at least introduces some skepticism, but
| that's not exactly reasoning. It makes me wonder what
| people refer to when they say "reason".
| suddenlybananas wrote:
| People think an approximation of a thing is the thing.
| therealpygon wrote:
| As best as I have understood, the LLMs output is directly
| related to the state of the network as a result of the
| context. Thinking is the way we use intermediate
| predictions to help steer the network toward a what is
| expected to be a better result through learned patterns.
| Reasoning are strategies for shaping that process to
| produce even more accurate output, generally having a
| cumulative effect on the accuracy of predictions.
| MangoToupe wrote:
| > Reasoning are strategies for shaping that process to
| produce even more accurate output
|
| How can it evaluate accuracy if it can't even detect
| contradictions reliably?
| therealpygon wrote:
| It doesn't? Reasoning is not an analysis; it is the
| application of learned patterns for a given set of
| parameters that results in higher accuracy.
|
| Permit my likely inaccurate illustration: You're pretty
| sure 2 + 2 is 4, but there are several questions you
| could ask: are any of the numbers negative, are they
| decimals, were any numbers left out? Most of those
| questions are things you've learned to ask automatically,
| without thinking about it, because you know they're
| important. But because the answer matters, you check your
| work by writing out the equation. Then, maybe you verify
| it with more math; 4 / 2 = 2. Now you're more confident
| the answer is right.
|
| An LLM doesn't understand math per se. If you type "2 + 2
| =", the model isn't doing math... it's predicting that
| "4" is the next most likely token based on patterns in
| its training data.
|
| "Thinking" in an LLM is like the model shifting mode and
| it starts generating a list of question-and-answer pairs.
| These are again the next most likely tokens based on the
| whole context so far. "Reasoning" is above that: a
| controlling pattern that steers those question-and-answer
| sequences, injecting logic to help guide the model toward
| a hopefully more correct next token.
| geraneum wrote:
| It's true that there are similarities between what you
| mentioned and what's happening in this case. From the article:
|
| > The result is a test-time loop that looks less like "chat
| with a compiler" in the case of sequential revision, and more
| like structured exploratory search, guided by explicit
| optimization hypotheses and aggressively parallel evaluation.
|
| My conclusion would be that we've now learned to apply LLMs'
| capabilities to shrink solution space where we have a clear
| evaluation function as well as solutions to problems that might
| follow similar patterns. This applies in this case as well.
|
| IMO, It's not about model X gaining on other models or model Y
| being able to reason about the solutions, etc. in a way that
| other models couldn't.
| MangoToupe wrote:
| Interesting. Do you have stronger evidence to support your
| claim? A sample size of one is pretty unconvincing.
| brrrrrm wrote:
| what's going to be interesting is to see the large space of fused
| kernels being tackled by AI generated code. that might include
| gemm + relu + gemm + a norm of some kind - which would be
| annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite
| as a human
| AtlasBarfed wrote:
| Uh, what is a "kernel" in the sense of AI? Because it sure
| looks like this isn't an OS kernel.
| philipkglass wrote:
| This is GPU terminology:
|
| https://cvw.cac.cornell.edu/gpu-architecture/gpu-
| characteris...
|
| _A function that is meant to be executed in parallel on an
| attached GPU is called a kernel. In CUDA, a kernel is usually
| identified by the presence of the __global__ specifier in
| front of an otherwise normal-looking C++ function
| declaration._
| ekelsen wrote:
| "FP32 is less common in modern ML workloads and often less
| optimized on recent hardware compared to FP16 or BF16, which may
| partly explain why it's easier to achieve performance gains over
| PyTorch with FP32 kernels."
|
| People haven't spent time optimizing the fp32 versions of these
| kernels in years. This will be much more interesting if they can
| improve the kernels where developer effort has gone and that are
| actually used.
| suddenlybananas wrote:
| I wonder if it's using known improvements from the fp16/bf16
| kernels that are transferable to fp32?
| moralestapia wrote:
| >People haven't spent time optimizing the fp32 versions of
| these kernels in years.
|
| Wow, so, you're basically saying the AI created new algos in a
| domain with no pre-existing solutions? Awesome!
| Aurornis wrote:
| No one said the AI created new algorithms nor that there
| weren't pre-existing solutions.
|
| The implication was that the FP32 versions of these kernels
| have lagged behind the more popular versions. There was
| opportunity to translate the advancements from other kernels
| into these. Someone would need to look closely to see exactly
| what was done, but it's premature to suggest anything like
| "new algos" or "no pre-existing solutions"
|
| This is a great use case for LLMs, though. I often do
| something similar where I make improvements to something I
| use most frequently and ask an LLM to translate that pattern
| to other similar parts of the code.
| moralestapia wrote:
| >The implication was that the FP32 versions of these
| kernels have lagged behind the more popular versions.
|
| Help me understand this 'cause I'm a bit slow these days
| ...
|
| Does that mean optimized FP32 versions of these kernels
| were already there or not?
| almostgotcaught wrote:
| > Help me understand this 'cause I'm a bit slow these
| days ...
|
| If I do `sed 's/f32/f16/g' kernel.cu` does this count as
| AI? Help me understand because I'm a little slow when it
| comes to all the dumb shit people attribute to LLMs these
| days...
| moralestapia wrote:
| Indeed, you're slow on these news.
|
| >sed 's/f32/f16/g' kernel.cu
|
| This is not what's happening here, it's a completely
| different thing, read TFA.
| imtringued wrote:
| You are a blatant troll and you know that.
| Dylan16807 wrote:
| > Does that mean optimized FP32 versions of these kernels
| were already there or not?
|
| If you're trying to support your original point with that
| argument, then you're using some pretty awful definitions
| of the terms "new algos" and "no pre-existing solutions".
| uoaei wrote:
| The hype cycle in action, folks. Pay heed.
| vlovich123 wrote:
| The solution not existing in PyTorch does not mean the
| solution doesn't exist elsewhere on the internet. Remember -
| PyTorch is largely maintained by employees of companies that
| have their own priorities for the SW and those priorities may
| not include hyper optimizing fp32 kernels.
|
| That being said, it is cool if AI is enabling lower cost
| adoption of better more optimized kernels with less effort.
| imtringued wrote:
| Read the article before spouting lies. Actually never mind
| that.
|
| Read the damn comment you're responding to. There have been
| human written kernels for both fp16 and fp32 for a long time.
|
| Here is the corrected version of your comment:
|
| "Wow, so, you're basically saying the AI created the same but
| faster algos in a well known domain with established pre-
| existing solutions, whose overall impact on the runtime of
| practical workloads is insignificant? Awesome!"
| adrian_b wrote:
| I believe that these good results are explained at least in
| part by the fact that NVIDIA does not provide detailed enough
| documentation for their GPUs.
|
| For a processor with well-documented microarchitecture, for
| which a programmer or a compiler can deterministically write an
| optimal program, it is much less likely that applying ML/AI can
| be successful, except as a substitute for searching already
| known solutions.
|
| On the other hand, for less documented microarchitectures, like
| of the NVIDIA GPUs, finding an optimal program may be
| impossible other than by doing a random search guided by
| examples of previous optimized programs, and possibly doing
| some reverse-engineering work to determine the real behavior of
| the GPU in some circumstances.
|
| Improving over something like this is likely to be feasible for
| ML/AI, where training over known good programs may be able to
| extract some of the undocumented behavior that may be non-
| obvious for humans reading those examples.
| fulafel wrote:
| Even with full information, we generally (or practically)
| aren't able to write optimal programs.
| pca006132 wrote:
| While it is decidable, people typically never produce optimal
| programs even for the hot path. It is just intractable and
| too slow to do right now.
|
| For register allocation and instruction selection, there is
| hope because it is FPT and there are algorithms to do it
| optimally in polynomial time, albeit with a large constant
| factor (FPT), making it impractical to apply to compilers as
| of today. For instruction scheduling, it is just too hard. If
| you read literature on scheduling algorithms, it is NP-hard
| for apparently simple instances, e.g., 2 parallel identical
| machines with no preemption and bounding completion time
| (https://www2.informatik.uni-osnabrueck.de/knust/class/),
| while actual microarchitecture is much more complicated than
| this...
|
| Needless to say, these are already the simpler problems. The
| longer the program or the more profiling data you can
| optimize for, the more tricks you can throw at it, and most
| of them are NP-hard to optimize optimally.
|
| Being NP-hard doesn't imply that you can't obtain the optimal
| result, but compilers that I know of do not implement them,
| because most users are not willing to wait for days for such
| a compilation to complete. Ideally, one should make something
| that can run on clusters of CPUs or GPUs to optimize this,
| and people having those clusters will typically be willing to
| do this because they want to optimize the program they later
| run on the clusters. However, to my knowledge, no one is
| working on this at the moment.
| david-gpu wrote:
| _> For a processor with well-documented microarchitecture,
| for which a programmer or a compiler can deterministically
| write an optimal program_
|
| You severely underestimate the landscape of possible
| implementations for these kernels. There are _many_ ways of
| performing a matrix multiplication and predicting which one
| will perform best without running them all is nontrivial,
| even with perfect knowledge of the underlying system.
|
| This is just a completely incorrect take, speaking as a
| former insider.
| mjlee wrote:
| > For a processor with well-documented microarchitecture, for
| which a programmer or a compiler can deterministically write
| an optimal program
|
| We don't even know the optimal algorithms! AlphaEvolve
| recently found "an algorithm to multiply 4x4 complex-valued
| matrices using 48 scalar multiplications, improving upon
| Strassen's 1969 algorithm that was previously known as the
| best in this setting." -
| https://www.nature.com/articles/s41586-022-05172-4
| hmry wrote:
| For those who don't want to read the article: The previous
| best was 49 scalar multiplications.
| mattkrause wrote:
| And I believe some humans subsequently knocked it down to
| 47?
| adityamwagh wrote:
| Sometimes I think of LLMs as kind of a hive mind. It's trained on
| thought processes of so many humans. I think that's why it's able
| to do these kinds of things given the fact that it has so much
| information and context compressed in weights.
| MangoToupe wrote:
| The market itself is also kind of a hive-mind metaphor. Worth
| thinking about.
| suddenlybananas wrote:
| Maybe we could replace it with a central planning now that we
| can distill information.
| MangoToupe wrote:
| Whoops you just did a communism
| gpm wrote:
| A "vertical integration" in the capitalist world ;)
| MangoToupe wrote:
| This got a legitimate chortle out of me
| yieldcrv wrote:
| a non-human standing committee following the directives
| of a trust could work
| MangoToupe wrote:
| What like you want to govern by divining patterns of
| snake coils or bird guts?
| constantcrying wrote:
| >and test for correctness by checking the numerical equality of
| the two outputs over many random inputs.
|
| This is fundamentally different to how any human would approach
| this problem. And also different to how some recent advances in
| this area were made, where AI actually came up with superior and
| correct algorithms.
|
| This approach also seems quite unfortunate and makes many of
| theses results somewhat doubtful.
| gotoeleven wrote:
| How else would you do the verification?
| constantcrying wrote:
| See: https://www.nature.com/articles/s41586-022-05172-4
|
| IIRC there was another paper recently, with similar
| methodology about computing xAx. These papers produce
| algorithms which aren't empirically correct, but provably
| correct. They do this by operating on a graph data structure,
| which describes the algorithm and then verifying the
| algebraic equality to the correct result.
|
| There is a substantial difference here. And I think utilizing
| algorithms which only are empirically correct can be
| dangerous.
| ekelsen wrote:
| "the reference code is in the default FP32, and given a tolerance
| threshold (1e-02)"
|
| that's a huge tolerance and allows them to use fp16 operations to
| replace the "fp32" kernel.
| unignorant wrote:
| yeah, it seems likely the underlying task here (one reasoning
| step away) was: replace as many fp32 operations as possible in
| this kernel with fp16. i'm not sure exactly how challenging a
| port like that is, but intuitively seems a bit less impressive
|
| maybe this intuition is wrong but would be great for the work
| to address it explicitly if so!
| AlotOfReading wrote:
| Only seems to have done that in a couple places, like the
| MatMul. The softmax kernel
| (https://github.com/ScalingIntelligence/good-
| kernels/blob/mai...) seem to be entirely bog-standard, and
| the layernorm kernels are only slightly more interesting.
| constantcrying wrote:
| This means the results are useless. Did they even check the
| _relative_ error at all?
|
| Replacing float32 operations with float16 is also pointless.
| There is nothing to be gained by doing this, as it removes the
| actual accuracy advantage of float32s, which would the single
| most important reason to use that version of the algorithm.
| threeducks wrote:
| I ran their matrix multiplication code from GitHub
| (https://github.com/ScalingIntelligence/good-
| kernels/blob/mai...) and got a mean squared error of
| approximately 0.056 for two 4096x4096 matrices containing
| random values between 0 and 1.
|
| I think this error is large enough that referring to it as
| FP32 is misleading.
|
| Also, the performance gains do not translate to my RTX 3060M
| GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it
| lacks the optimized hardware for half precision.
|
| But on the plus side, the single file was very easy to adapt
| and the code is quite readable. I have seen much uglier
| kernels.
| beyonddream wrote:
| Why do you think it is a huge tolerance ? (Just curious since
| it is not clear to me if that will lead to too much of
| reduction in numerical accuracy compared to the speedup)
| JSR_FDED wrote:
| Could this be used to create kernels for OpenCL, ROCm, etc?
| vessenes wrote:
| By far the most interesting part (after the 400% speed up in some
| cases) is the methodology: rather than hill climb on operations,
| they forced a language reasoning step between iterations to
| encourage diversity of search. This seems to have worked. Very
| very interesting.
| lucidrains wrote:
| oh wow, I was looking for use of islands or map-elites that I
| missed this.. thought it was the blandest mimetic evolution
| possible
| vessenes wrote:
| Just anecdotally I feel like hill climbing on operations is
| just so slow; I'm not saying it doesn't work, but it always
| feels one step away from brute force search. I really like
| the idea of just throwing stuff at the LLM and giving it
| access to old strong variants in context.
| david-gpu wrote:
| Disclaimer: This used to be my bread and butter, but I'm _really_
| rusty after five years of not working on this sort of stuff.
|
| That said, after quickly skimming the example AI-generated kernel
| I am not seeing anything novel there. While working at nVidia I
| did see a handful of techniques that, frankly, blew my mind.
|
| Thus, I wonder what makes this AI-generated kernel faster than
| the standard pyTorch kernel, which I presume is simply delegating
| all the heavy lifting onto cuDNN. My guess, and it's just a
| guess, is that they are comparing the fastest AI-generated kernel
| they produced for a very particular set of parameters against
| whatever kernel cuDNN is picking for that same scenario, and
| perhaps the subsystem inside cuDNN that picks which kernel to
| execute out of the very large database it manages chose a
| suboptimal candidate. Researchers tend to completely ignore this
| issue and assume that cuDNN is always able to choose the very
| best kernel in every possible scenario, something that is just
| not realistic.
|
| Maybe there is something else going on, but these sort of _" we
| have beaten this heavily optimized proprietary library"_ always
| seem to miss this very important point.
|
| Kind regards to any NVidia insiders who may read this. You guys
| are the brightest people I've ever met.
| zahlman wrote:
| > Thus, I wonder what makes this AI-generated kernel faster
| than the standard pyTorch kernel
|
| All of this stuff is way outside my wheelhouse, but maybe "the
| standard pyTorch kernel" is just a low bar?
| (https://news.ycombinator.com/item?id=44144346)
| MangoToupe wrote:
| > Our results are benchmarked on an Nvidia L40S
|
| At the very least they could have used consumer hardware. I don't
| even know how to parse that model it's so consumer-alien.
| Maxious wrote:
| It's the AD102 chipset ie. RTX 4090 with perfect binning (like
| the never released 4090 Ti would have had) and 48GB of VRAM
| soldered on https://www.techpowerup.com/gpu-
| specs/?architecture=Ada%20Lo...
| klingenm wrote:
| This sounds more like using AI (llm) as one small step, where the
| randomness in the output is used to implement a Genetic
| Algorithm, than being "AI-generated" (admittedly technically
| correct).
|
| (Edit, typo)
| FL33TW00D wrote:
| Tried a replication here. The LayerNorm kernel is not numerically
| stable so cannot be counted as valid. They only test with zero
| mean and unit std, so the catastrophic cancellation doesn't show
| up until after.
|
| EDIT: looks like they've since generated another one that is
| numerically stable! great work
| userbinator wrote:
| Am I the only one who was enticed into this article by thinking
| they had AI generate an OS kernel?
| dgfitz wrote:
| Nope, I was as well.
| miki123211 wrote:
| I think how the authors of this post think about "AI agents" is
| really interesting.
|
| Most people think of agents like they think of human employees.
| They set up a limited number of agents to run in parallel (often
| just one), with each agent running in a loop and doing one task
| at a time. They're still in a world where you have a fixed (on
| the timescale of hours or days) number of employees, each
| employee can only do one thing at a time, and transferring tasks
| between employees is slow and costly.
|
| LLMs don't really work like that. You effectively have an
| infinite number of agents that you can conjure out of thin air at
| any time. There's no cost advantage to performing LLM requests in
| series rather than in parallel.
|
| If you realize this, the pattern of each agent fanning out and
| forking itself into as many sub-agents as are needed to fulfill
| the task becomes obvious. This is exactly what the authors have
| done.
|
| I think a better way to think of agents is as "tasks" or "jobs",
| like those you might find in Celery or sidekik, and apply the
| learnings from those.
| neom wrote:
| For fun last month I decided to see if i could build a fully
| functional business of agents. It's 175 python files
| (employees) build up of internal roles within those files
| (tasks). So what I have is 175 employees who are able to pass
| output around each other, understand the work, complete the
| work, understand where to send the output. The whole system has
| the ability to do around 275 base processes (same as a business
| at > 100MM arr) I started on a Friday afternoon and slept a
| little bit and finished on Monday afternoon. After I had it
| running I sent it to a VC friend to show them and they sent
| back the deck of a startup that is in stealth with $25MM doing
| it the _exact_ same way. With 1 month and a designer and an
| engineer, I could have it mvp functional for anyone to use
| ($40k?). Times are changing. Here is kinda how it looks:
| https://s.h4x.club/9ZuO4XQR / https://s.h4x.club/jkuB8ZED (I've
| evolved it a little since this, and if you're an engineer and
| look at my files and think, this guy is a moron: I know!:))
| literalAardvark wrote:
| Engineers who would judge someone's frontier MVP like that
| are not even worth worrying about.
|
| This is epic work. Would love to see more of it but I guess
| you're gonna take it the startup route since you have
| connections. Best of luck.
| neom wrote:
| Thanks!!! I decided not to build it, that space is already
| too busy, there is a startup with $25MM in stealth, who
| else is in stealth? On top of that, this method will get
| stale very very quickly, foundation model businesses are
| just too hard to work around right now, it's a silly way to
| do business. My magic is I've build a startup from scratch
| to over 400 people and watched what they do, it won't be
| long till that isn't worth much.
| wslh wrote:
| Your message doesn't make it clear what those 175 employees
| can realistically accomplish on their own.
|
| For instance, you might have an SEO expert on the team, but
| that alone won't guarantee top search engine rankings. There
| are countless SEO professionals and tools (human or AI-
| powered), and even having the best one doesn't eliminate the
| underlying challenge: business competition. LLMs, like any
| other tool, don't solve that fundamental problem.
| neom wrote:
| No employees accomplish anything on their own in the real
| world, all employees are part of a team. That's why I
| designed a business strategy and analysis layer (over half
| the system, in fact), with web tools and connections to all
| of the insights systems (like mix panel). I built the
| _exact_ same thing I build at digitalocean but instead of
| humans I defined them with code, digitalocean runs just
| fine, so does my LLM system. The whole system I build is
| self learning, insight gathering and refinement.
| Competition is for losers, the best teams win via the best
| insights.
| vasco wrote:
| Why 175? Why not 5 billion employees? Why not 20000
| companies in parallel? Why not simulate 5 earth's worth
| of history and setup a full universe of worlds full of
| startups?
|
| This sounds like those guys in social media that one up
| each other with their bed times and end up saying they
| wake up every day at 2am to meditate and work out
| neom wrote:
| Because that was the scope of the project. When we got to
| 400 employees at DigitalOcean I noticed I thought it was
| really half that, original I just sat out to make the
| marketing and strategy team, but got bit carried away,
| the fp&a team was the only group I really struggled with,
| my cfo skills are very meh.
| vasco wrote:
| 1 single agent with a good model is going to beat that
| approach every single time. The same way Whatsapp needed
| only 55 people (and probably the last hires were not
| needed for the outcome) to sell for $19B.
|
| And other companies have existed for hundreds of years
| and had thousands of people work for them and never even
| made $100M.
| neom wrote:
| I'm confused what you're saying. There a loads of
| markets, loads of segments, loads of ways to do unit
| economics, yes, but business is business, it's
| prescriptive at it's core. I'm using a single model, it's
| just openai calls using the role function.
| yusina wrote:
| > understand the work
|
| LLMs don't _understand_. It 's mind-boggling to me that large
| parts of the tech industry think that.
|
| Don't ascribe to them what they don't have. They are
| fantastic at _faking_ understanding. Don 't get me wrong, for
| many tasks, that's good enough. But there is a fundamental
| limit to what all this can do. Don't get fooled into
| believing there isn't.
| neom wrote:
| What is the limit my system will reach?
| rzz3 wrote:
| Thats an interesting word to pick on. Understanding still
| means something here in a relative sense.
| zenburnmyface wrote:
| meh. I feel this is just a linguistic shortcut, similar to
| how _trained_ biologists can talk about a species or
| organism evolving some trait. Of course the organism isn't
| _really_ evolving with any goal in mind, but that's clear
| to the speaker and audience. Whether or not LLMs understand
| (very unlikely), it's clear what we mean by an LLM
| "understanding": has the context + prior training to make
| reasonable predictions. But no one wants to write that each
| time.
| yusina wrote:
| That's an interesting take and in fact one I could get
| behind.
|
| But I'm afraid that most folks using the term mean it
| more literally than you describe.
| philistine wrote:
| Exactly. The whole point of all the LLM companies is to
| get grandma to use it. If you say _understand_ about a
| technology with the desired appeal of Facebook, then
| you're talking to everyone and words matter extra hard.
| motorest wrote:
| > LLMs don't understand. It's mind-boggling to me that
| large parts of the tech industry think that.
|
| I think you might be tied to a definition of
| "understanding" that doesn't really apply.
|
| If you prompt a LLM with ambiguous instructions, it
| requests you to clarify (i.e., extend prompt to provide
| more context) and once you do the LLM outputs something
| that exactly meets the goals of the initial prompt, does it
| count as understanding?
|
| If it walks like a duck and quacks like a duck, it's a
| duck,or something so close to a duck that we'd be better
| off calling it that.
| acchow wrote:
| > If you prompt a LLM with ambiguous instructions, it
| requests you to clarify (i.e., extend prompt to provide
| more context)
|
| It does not understand that it needs clarification. This
| behavior is replicated pattern
| AlecSchueler wrote:
| What is the difference? What would actual understanding
| look like?
| accCer wrote:
| It depends on which human feedback was used to train the
| model. For humans, there are various communication models
| like the four-sides model. If the dataset has annotations
| for the specific facets of the communication model, then
| an LLM trained on this dataset will have specific
| probabilities that replicate that communication model.
| You may call this understanding what the prompter says,
| but it's just replication for me.
| bandrami wrote:
| The difference comes when it receives novel input
| Spivak wrote:
| So you have two prompts, one is ambiguous and the second
| is the same prompt but with the ambiguity resolved.
|
| In the first prompt the replicated pattern is to ask for
| clarification, in the second prompt the replicated
| pattern is to perform the work. The machine might
| understand nothing but does it matter when it responds
| appropriately to the different cases?
|
| I don't really care whether it understands anything at
| all, I care that the machine behaves as though it did
| have understanding.
| vasco wrote:
| Asking a short question but in a serious way: so what?
| yusina wrote:
| You are asking why it is meaningful to use terms for what
| they mean instead of making up things?
|
| Well, I prefer it that way, but the spirit of "AI" seems
| to go in another direction, and the leadership of US
| government also does, so maybe times are just changing.
| hayst4ck wrote:
| Nearly every argument like this has the same fatal flaw,
| and it's generally not the critique of the AI, but the
| critique reflected back on to humans.
|
| _Humans also don 't understand_ and are frequently faking
| understanding, which for many tasks is good enough. There
| are fundamental limits to what _humans_ can do.
|
| The AI of a few months ago before OpenAI's sycophancy was
| quite impressive, less so now which means it is being
| artificially stunted so more can be charged later. It means
| privately it is much better than what is public. I can't
| say it "understands," but I can say it outclasses many many
| humans. There are already numbers of tasks based around
| understanding where I would already choose an LLM over a
| human.
|
| It's worth looking at bloom's taxonomy
| (https://en.wikipedia.org/wiki/Bloom%27s_taxonomy): _In the
| 2001 revised edition of Bloom 's taxonomy, the levels were
| renamed and reordered: Remember, Understand, Apply,
| Analyze, Evaluate, and Create._ In my opinion it is at
| least human competitive for everything but create.
|
| I used to be very bearish on AI, but if you haven't had a
| "wow" moment when using one, then I don't think you've
| tried to explore what it can do or tested it's limits with
| your own special expertise/domain knowledge, or if you have
| then I'm not sure we're using the same LLMs. Then compare
| that experience to normal people, not your peer groups.
| Compare an LLM to people into astrology, crystal healing,
| or homeopathy and ask which has more "understanding."
| yusina wrote:
| Um, moving the goal post?
|
| The claim was LLMs understand things.
|
| The counter was, nope, they don't. They can fake it well
| though.
|
| Your argument now is, well humans also often fake it.
| Kinda implying that it means it's ok to claim that LLMs
| have understanding?
|
| They may outclass people in a bunch of things. That's
| great! My pocket calculator 20 years also did, and it's
| also great. Neither understands what they are doing
| though.
| neom wrote:
| It's fun to talk about, but personally he whole
| "understanding" debate is a red herring, imo what we
| actually care about when we talk about intelligence is
| the capacity and depth of: second order thinking,
| regardless of the underlying mechanism. I think
| personally key question isn't "do LLMs understand?" but,
| "can LLMs engage in second order thinking?" The answer
| seems to be yes - they can reason about reasoning, plan
| their approaches, critique their own outputs, and adapt
| their strategies, o1 has shown us that with RL and
| reasoning tokens you can include it in a single system,
| but our brains have multiple systems we can control and
| that can be combined in multiple ways at any given
| moment: emotions, feelings, thoughts combined into user
| space, 3 core systems input, memory, output. The nuances
| is in the fact that various reasons (nature + nurture),
| various humans appear to have varying levels of meta
| control over the multiple reasoning systems.
| perching_aix wrote:
| Why are you pretending to be participating in a debate?
| You mention things like "moving the goalpost",
| "counter[arguments]", and "arguments", as if you did
| anything more than just assert your opinion in the first
| place.
|
| This is what you wrote:
|
| > LLMs don't understand.
|
| That's it. An assertion of opinion with nothing else
| included. I understand it sucks when people feel
| otherwise, but that's just kinda how this goes. And
| before you bring up how there were more sentences in your
| comment, I'd say they are squarely irrelevant, but sure,
| let's review those too:
|
| > It's mind-boggling to me that large parts of the tech
| industry think that.
|
| This is just a personal reporting of your own feelings.
| Zero argumentational value.
|
| > Don't ascribe to them what they don't have.
|
| A call for action, combined with the same assertion of
| opinion as before, just rehashed. Again, zero
| argumentational value.
|
| > They are fantastic at faking understanding.
|
| Opinion, loaded with the previous assertion of opinion.
| No value add.
|
| > Don't get me wrong, for many tasks, that's good enough.
|
| More opinion. Still no arguments or verifiable facts
| presented or referenced. Also a call for action.
|
| > But there is a fundamental limit to what all this can
| do.
|
| Opinion, and a vague one at that. Still nothing.
|
| > Don't get fooled into believing there isn't.
|
| Call for action + assertion of opinion again. Nope, still
| nothing.
|
| It's basically the kind of comment I wish I could just
| have filtered out before it ever reached me. Zero
| substance, maximum emotion. This is no way to discuss
| anything, let alone something you or others likely feel
| strongly about.
| roryirvine wrote:
| I do agree with you - but the big difference is that
| humans-who-are-faking-it tend to learn as they go so
| might, with a bit of effort, be expected to understand
| eventually.
|
| Does that actually matter? Probably not for many everyday
| tasks...
| squidbeak wrote:
| Excellently put.
| bobxmax wrote:
| How do you know?
| yusina wrote:
| Extraordinary claim requires extraordinary proof. I don't
| know, but I'm also not the one claiming something.
|
| (Besides, we _know_ what LLMs do, and none of those
| things indicate understanding. Just statistics.)
| shawabawa3 wrote:
| You can create a new game with new rules never seen
| before
|
| You can explain this to an LLM
|
| The LLM can then play the game following the rules
|
| How can you say it hasn't understood the game?
| yusina wrote:
| The LLM is only capable of doing so if it has encountered
| something similar before as part of its training.
|
| Claiming anything else requires a proof.
| neom wrote:
| the extraordinary claim would be that LLMs can only do
| things they've seen before exactly, given the
| compositional and emergent capabilities we observe. The
| evidence suggests they can generalize beyond their
| training in meaningful ways, even if imperfectly...if a
| human came out living but with a brain that had zero
| electrical activity, that would be extraordinary, we
| normally come out with a baseline of pre-programming. I
| sometimes think this debate happens because humans don't
| want to admit we're nothing more than LLMs programmed by
| nature and nurture, human seem to want to be especially
| special.
|
| https://arxiv.org/abs/2206.07682
|
| https://towardsdatascience.com/enhanced-large-language-
| model...
|
| https://arxiv.org/abs/2308.00304
|
| (and if MoRA is moving the goal posts, fine: RL/RT)
| yusina wrote:
| >if a human came out living but with a brain that had
| zero electrical activity, that would be extraordinary, we
| normally come out with a baseline of pre-programming.
|
| That statement reveals deep deficiencies in your
| understanding of biological neural networks. "electrical
| activity" is very different from "pre-programming".
| Synapses fire all the time, no matter if meaningfully
| pre-programmed or not. In fact, electrical activity
| decreases over time in a human brain. So, if anything,
| programming over time reduces electrical activity (though
| there is no established causal link).
|
| > I sometimes think this debate happens because humans
| don't want to admit we're nothing more than LLMs
| programmed by nature and nurture, human seem to want to
| be especially special.
|
| It's not specific to humans. But indeed, we don't fully
| understand how brains of humans, apes, pigs, cats and
| other animals really work. We have some idea of synapses,
| but there is still a lot unclear. It's like thinking just
| because an internal combistion engine is made of atoms,
| and we mostly know how atom physics and chemistry work,
| that any body with this basic knowledge of atom physics
| can understand and even build an ICE. Good luck trying.
| It's similar with a brain. Yes, synapses play a role. But
| that doesn't mean a brain is "nothing more than an LLM".
| neom wrote:
| Neural activity begins around 6 weeks gestation,
| electrical patterns help establish basic neural circuits,
| activity dependent neural development shapes connectivity
| before any sensory input, critical periods where
| electrical activity literally sculpts brain architecture.
| Motor patterns get programmed before birth (why babies
| can suck, grasp, etc.), language processing areas develop
| structural biases before hearing language, visual cortex
| develops orientation maps before seeing anything, basic
| learning algorithms get "wired in" through developmental
| processes. If a human emerged, was able to function in
| the world, do things, but had zero electrical activity in
| the brain, that would be... normal? No: extraordinary.
|
| Humans arrive out of the VJJ with innate neural
| architectures to be filled and developed - not literal
| blank slates, there is an OS. The electrical activity
| during development is literally the biological process
| that creates our "base programming." LLMs have
| architectural inductive biases (attention mechanisms,
| etc.), human brains have evolved architectural biases
| established through fetal development. We're both "pre-
| programmed" systems, just through different mechanisms.
|
| Your response about "electrical activity decreases over
| time" is irrelevant - you weren't talking about adult
| brain activity, you were talking about the developmental
| process that creates our initial neural architecture.
|
| tbh: I can't tell if you're engaging in good faith or
| not.
| GaggiX wrote:
| They understand tho, it's different than how it's done in
| our brain but they solve task that would be impossible to
| do without understanding. I would even say that they can
| now reason through problems thanks to powerful reasoning
| models like Gemini 2.5 Pro and o3.
| GoatInGrey wrote:
| I don't believe the user meant "understand" in the
| classical biological and philosophical sense, or were
| otherwise attempting to anthropomorphize the systems. They
| were speaking from the practical experience of "this thing
| takes a somewhat ambiguous input with unique constraints
| and implements the ask more-or-less as intended".
| squidbeak wrote:
| They understand. Anything able to reason about any
| arbitrary request and form a plan tailored to that request
| understands well enough to qualify for the verb. The
| mechanism behind it may feel hollow or fake. But if its
| responses reliably show understanding, the LLM understands
| - by any ordinary measure.
| holoduke wrote:
| Definition of understanding is based on connecting
| relations. If there is one thing a llm can do its
| connecting relations. So I am not sure why you say llms are
| not understanding.
| immibis wrote:
| Does this experiment do anything useful or does it just soak
| up investor money? Not that there's anything wrong with the
| latter.
| neom wrote:
| The only investor is me. I build it on my own over a
| weekend, on my own. I just wanted to confirm it can be done
| therefore will exist, that is all. Personally, I decided
| not to peruse it because I am old and lazy and don't want
| to compete against a16z and sequoia funded adderall filled
| teenagers.
| immibis wrote:
| I meant the one that investors are paying for.
| acchow wrote:
| > The whole system has the ability to do around 275 base
| processes
|
| It's incredibly easy to get LLMs to do a lot of _stuff_ that
| seems convincing.
|
| They are literally trained for plausibility.
| robbomacrae wrote:
| Is anyone else annoyed that VC's are out there sharing decks
| of startups in stealth with potential competitors? How often
| does this happen?
| eterm wrote:
| I would be annoyed along with you if I thought the post was
| true.
| IncreasePosts wrote:
| It's not a lie, it is just vibe posting
| iammrpayments wrote:
| Sounds really interesting but I have no idea what's the
| purpose of having 175 "employees" here? Maybe it is a smart
| way to sell the idea you're going to replace 175 people if
| you buy the product? Could just buy chatgpt instead I guess,
| but a chatbot doesn't sound as cool as 175 employees.
| neom wrote:
| I would love to know how to do it another way if you have
| any ideas, I'm sadly not experienced or intelligent enough
| to think of another way to do it.
| jprokay13 wrote:
| I've been floating around a similar set of ideas and it's
| been very fun (if not all that useful yet) to build Did you
| try taking it one step further where a "recruiter" has to
| hire the engineers after a screening process? I wonder if
| this could get you even better AI engineers
| catlifeonmars wrote:
| This really sounds like a "faster horse" scenario and totally
| misses the point of the GPs comment: why shackle yourself to
| modeling the way humans work?
| mucha wrote:
| Cool. What goods/services does your business provide to
| customers?
| londons_explore wrote:
| > forking itself into as many sub-agents as are needed to
| fulfill the task
|
| The forking is free. Running the sub-agents is linear cost, but
| the expensive bit is joining the agents responses back together
| again.
|
| If a task has 6 subtasks and an agent is spawned for each, at
| some point some 'joiner' agent needs to parse and summarize the
| findings of the sub agents and feed it back to the parent. That
| step necessarily involves information loss, and uses more
| computation that a single linear agent design would not use.
| neom wrote:
| I designed something for a business and found I needed 4
| major sub-systems (like a real business) - insight/data,
| cognition, meta cognition and execution, and if you don't
| define all 4, the system is junk.
| motorest wrote:
| > I designed something for a business and found I needed 4
| major sub-systems (like a real business) - insight/data,
| cognition, meta cognition and execution, and if you don't
| define all 4, the system is junk.
|
| Might it be just another realization of Conway's law?
|
| https://en.wikipedia.org/wiki/Conway%27s_law
|
| Might it be possible that the only reason you're assuming a
| system is junk is just that it doesn't resemble the systems
| you know and expect? There are so many ways to skin a cat,
| and certainly no business process represents the optimal
| process.
| viraptor wrote:
| > They set up a limited number of agents to run in parallel
| (often just one),
|
| Most of what people use agents for daily can often be one-
| shotted though and even collating/rating 10 results would be
| costly.
|
| If I had a harness for evaluating the results and VC level
| money, I'd be throwing an army at well defined experimental
| tasks as well.
| yusina wrote:
| > You effectively have an infinite number of agents
|
| You don't.
|
| Sincerely, Your Electricity Bill
| TimPC wrote:
| The challenge with fan out is constructing a linear
| conversation that makes sense that captures previous history.
| In any context where the LLM needs that information linear
| loops often perform better than trying to splice together
| conversations from multiple parallel processes.
| kposehn wrote:
| This is similar to something we've been doing for a while.
| Instead of individual agents we are creating many iterations
| and sub-iterations of spawned agents that are largely
| autonomous. A lot of the human-centric paradigms just don't
| really apply to LLMs/AI but people are used to approaching them
| that way.
| bgwalter wrote:
| > They are performing close to or in some cases even beating the
| standard expert-optimized production kernels shipped in PyTorch.
|
| The PyTorch code base is NOT written by performance experts in
| any way. This is the wrong base line. Nothing about that code
| base is clean or hand optimized .
|
| The "AI" generation methodology seems to give many instructions
| and even descends into instruction trees, manually throwing away
| results etc. So it requires, as usual, extreme guidance.
| poltomo wrote:
| Beating pytorch and tensorflow kernels has been easy to do with
| ml compilers since ~2018. You typically train and evaluate your
| model in one of these frameworks then hand off the computation
| graph to a compiler like Apache TVM or your hardware vendor's
| proprietary one. They should test their kernels against those
| kernels.
|
| ML guided heuristic search over compute schedules is as old as
| 2013 (Halide for image processing)
___________________________________________________________________
(page generated 2025-05-31 23:00 UTC)