[HN Gopher] Surprisingly fast AI-generated kernels we didn't mea...
       ___________________________________________________________________
        
       Surprisingly fast AI-generated kernels we didn't mean to publish
       yet
        
       Author : mfiguiere
       Score  : 371 points
       Date   : 2025-05-30 20:03 UTC (1 days ago)
        
 (HTM) web link (crfm.stanford.edu)
 (TXT) w3m dump (crfm.stanford.edu)
        
       | yahoozoo wrote:
       | Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they
       | don't mention which one produced the better kernels.
        
       | reliabilityguy wrote:
       | Is my understanding correct that they assumed a fixed size of the
       | input?
       | 
       | If so, why is it surprising that generic implementations in
       | PyTorch are worse?
        
         | GaggiX wrote:
         | Pytorch uses different kernels depending on the input size.
         | There is a reason why it's so massive to download.
        
           | reliabilityguy wrote:
           | Sure, some degree of customization is expected. However, I
           | doubt that PyTorch implements _every_ input size separately.
        
       | Workaccount2 wrote:
       | Very fascinating result, and it seems they wrote this blog post
       | out of pure excitement to share their findings, and maybe to have
       | someone throw cold water on it before publishing, ha.
       | 
       | Who knows if this is the actual fabled path of "self
       | improvement", but results like this are what we expect to find on
       | such a path.
        
         | suddenlybananas wrote:
         | > Who knows if this is the actual fabled path of "self
         | improvement"
         | 
         | Seems doubtful as this works only on an extremely well-defined
         | evaluation function.
        
           | EMIRELADERO wrote:
           | That may be true, but this is the first example I've seen
           | where the concept is successfully implemented in a noticeable
           | way.
           | 
           | It's just like image generation: the first iteration is the
           | worst it will ever be.
        
           | observationist wrote:
           | Each time you define another task well enough for the system
           | to work, you generalize the system just a little bit - repeat
           | enough times and you can start to expand, develop taxonomies
           | of functions, precisely define function spaces and metrics
           | for improvement. This might not be a bootstrap for recursive
           | self improvement generally, but it could definitely inform
           | the theory or design of a system that does bootstrap rsi.
        
             | suddenlybananas wrote:
             | That's an entirely different idea that may or may not work.
             | This is not evidence of that.
        
               | observationist wrote:
               | The structure of their research - the process, the
               | specific task, and the data they generate - will help
               | inform how other research gets performed. Instead of GPU
               | kernels, maybe the next task is something like neuron
               | modules, looking for structures that improve on attention
               | blocks, or things like that - each time you run through
               | an experiment like this, you're creating foundational
               | data upon which other experiments can be run and
               | improved. Once you've done enough of them, you can
               | generalize.
               | 
               | It could be that the end result is the knowledge of
               | strict boundaries of LLM capabilities, that they can only
               | operate in specific domains, or only improve to a certain
               | extent, and some currently unspecified defect limits the
               | level of improvement.
               | 
               | The underlying idea of specifying a domain and task
               | conditions, then letting an LLM run thousands of
               | experiments, is a great search technique. The hope is
               | that there is no implicit defect and that the methodology
               | will extend and generalize - it's not too complex a
               | notion to think that you could have an LLM create a broad
               | range of individual tasks, with a meta-goal of
               | identifying better and more general recursive improvement
               | processes and algorithms.
        
               | suddenlybananas wrote:
               | >The hope is that there is no implicit defect and that
               | the methodology will extend and generalize - it's not too
               | complex a notion to think that you could have an LLM
               | create a broad range of individual tasks, with a meta-
               | goal of identifying better and more general recursive
               | improvement processes and algorithms
               | 
               | Again, entirely different idea that doesn't have a
               | straightforward evaluation function. As it stands, this
               | is more akin to genetic programming with a very good
               | mutation function.
        
       | thorum wrote:
       | My takeaway - from this article, from Google's AlphaEvolve [1],
       | and the recent announcement about o3 finding a zero day in the
       | Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular
       | have reached a new level of capability where these ideas that
       | were tried unsuccessfully with other models, suddenly just work.
       | 
       | [1] https://deepmind.google/discover/blog/alphaevolve-a-
       | gemini-p...
       | 
       | [2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-
       | cve-...
        
         | jiggawatts wrote:
         | Gemini Pro 2.5 is the first AI that I can productively use for
         | anything other than human language translation, but it's just
         | _barely_ crossed that threshold. Sometimes I get success hit
         | rates below 20%.
         | 
         | When 3.0 comes out, that... that's going to start getting a
         | little scary.
        
           | jacob019 wrote:
           | What domain?
        
             | jiggawatts wrote:
             | SRE / DevOps / coding mostly in the Azure and .NET
             | ecosystems.
             | 
             | The problems I have to solve tend to be the horrible ones
             | that nobody has answers to, anywhere on the Internet, so
             | unsurprisingly the AIs aren't good at it either.
             | 
             | The trick has been to use the AIs for what they _are_ good
             | that, which used to be  "nothing" for me at least, but now
             | I can use them productively for certain "spot" tasks.
             | 
             | Random examples:
             | 
             | - Cross-language and cross-platform benchmarking of a bunch
             | of different database clients to see how they stack up. I
             | gave the AI a working example in one language and got it to
             | whip up a series of equivalents with other DB drivers and
             | languages. Sure, it's trivial, but it's way faster than
             | doing it myself!
             | 
             | - Crash dump analysis using WinDbg. I read somwhere that
             | "vibe debugging" of kernel dumps totally works, so when I
             | had an actual crash I gave it a go for laughs. With AI help
             | I managed to extract the name of the specific file that had
             | NTFS corruption and was crashing the server. Deleted the
             | file, restored it from backups, and the server was good to
             | go again!
             | 
             | - If you ever watch the top mechanical engineers on
             | YouTube, they _all make their own tools_ instead of just
             | buying them. Jigs, extenders, unusual sizes, etc... IT work
             | is the same. As a recent example, I got Gemini to make me a
             | code-AST rewriter for a specific issue I wanted to clean up
             | in bulk across a huge code base. Using the Roslyn compiler
             | SDK is a bit fiddly, but it spat out a working tool for me
             | in under an hour. (This is not something you can solve with
             | a script full of regex, it needed a proper parser to handle
             | commented-out blocks and the like.)
        
               | jacob019 wrote:
               | Sounds like interesting work, thanks for sharing! "Vibe
               | debugging", hah, I like that one. The latest crop of
               | models is definately unlocking new capabilities, and I
               | totally get the desire to make your own tools. I do that
               | to a fault sometimes, but it's nice to have a simple tool
               | that does exactly one thing, exactly the way you want it.
               | 
               | I've been pair programming with the models for a while,
               | and wrote some "agents" before I knew to call it that
               | back in the dark days of GPT-3.5, but only recently with
               | the latest models unlocking capabilities beyond what I
               | could achieve with handwritten code.
        
               | mholm wrote:
               | > Sure, it's trivial, but it's way faster than doing it
               | myself
               | 
               | That's the clincher for me. So much software work is just
               | excecuting on a design, not inventing anything new. Being
               | able to do 5x the trivial work in an hour is life
               | changing, and it lets me pull my head out of that work to
               | see how I can make larger process improvements. AI
               | doesn't need to rewrite the linux kernel in Rust to be
               | extremely valuable to the average developer
        
           | manmal wrote:
           | o3 is in my experience often even better, but too slow and
           | too rate limited to use it all the time.
        
         | zozbot234 wrote:
         | Wait, what are you saying? These have nothing to do with the
         | Linux kernel whatsoever, they are "kernels" in the GPU
         | programming sense. Did you just hallucinate this whole comment
         | or what?
        
           | None4U wrote:
           | There was a post on HN a bit ago from someone who used o3 to
           | find a vulnerability in the Linux kernel's SMB server, which
           | this person is just saying should've been tried earlier and
           | probably recently became possible
        
           | thorum wrote:
           | Sorry, I added links! Just a week ago someone built a system
           | that used o3 to find novel zero days in the Linux kernel's
           | SMB implementation.
        
             | stefan_ wrote:
             | Theres zero days in obscure parts of the kernel nobody uses
             | every other day. (It also of course found 100 other things
             | that were not zero days or vulnerabilities, yet professed
             | they were, which is why this trash even on Gemini 9000 Pro
             | keeps spamming security mails)
        
         | therealpygon wrote:
         | In my opinion, I wouldn't say so much that they are suddenly
         | working. Rather we've reached a point where they can iterate
         | and test significantly faster than humans are capable of doing
         | and have the ability to call on significantly more immediately
         | available information that it can make sense of, and as a
         | result, the combination information, advancement and
         | intelligently applied brute force seems to be having success in
         | certain applications.
        
           | thorum wrote:
           | Good points. I suspect that o3 is able to reason more deeply
           | about different paths through a codebase than earlier models,
           | though, which might make it better at this kind of work in
           | particular.
        
             | therealpygon wrote:
             | Very likely. Larger context is significantly beneficial to
             | the LLMs when they can maintain attention, which was part
             | of my point. Imagine being able to hold the word for word
             | text of your required reading book while you are taking a
             | test, while older models were more like a couple chapters
             | worth of text. _Two_ years ago.
        
             | westoncb wrote:
             | I was blown away by some debugging results I got from o3
             | early on and have been using it heavily since. The early
             | results that caught my attention were from a couple cases
             | where it tracked down some problematic cause through
             | several indirect layers of effects in a way where you'd
             | typically be tediously tracing step-by-step through a
             | debugger. I think whatever's behind this capability has
             | some overlap with really solid work it'll do in abstract
             | system design, particularly in having it think through
             | distant implications of design choices.
        
               | notyourwork wrote:
               | I'm interested in learning more about how you use o3 for
               | debugging.
        
               | westoncb wrote:
               | The main trick is in how you build up it's context for
               | the problem. What I do is think of it like a colleague
               | I'm trying to explain the bug to: the overall structure
               | is conversational, but I interleave both relevant source
               | chunks and _detailed_ / _complete_ observational info
               | from what I 've observed about anomalous program
               | behavior. I typically will send a first message building
               | up context about the program/source, and then build up
               | the narrative context for particular bug in second
               | message. This sets it up with basically perfect context
               | to infer the problem, and sets you up for easy reuse: you
               | can back up, clear that second message and ask something
               | else, reusing detailed program context given by the first
               | message.
               | 
               | Using it on the architectural side you can follow a
               | similar procedure but instead of describing a bug you're
               | describing architectural revisions you've gone through,
               | what your experience with each was, what your objectives
               | with a potential refactor are, where your thinking's at
               | as far as candidate reformulations, and so on. Then
               | finish with a question that doesn't overly constrain the
               | model; you might retry from that conversation/context
               | point with a few variants, e.g.: "what are your thoughts
               | on all this?" or "can you think of better primitives to
               | express the system through?"
               | 
               | I think there are two key points to doing this
               | effectively:
               | 
               | 1) Give it full, detailed context with nothing
               | superfluous, and express it within the narrative of your
               | real world situation.
               | 
               | 2) Be careful not to "over-prescribe" what it says back
               | to you. They are very "genie-like" where it'll often give
               | exactly what you ask for in a rather literal sense, in
               | incredibly dumb-seeming ways if you're not careful.
        
             | MangoToupe wrote:
             | In the context of LLMs, what do you mean by "reason"? What
             | does reasoning look like in LLMs and how do you recognize
             | it, and more importantly, how do you invoke it? I haven't
             | had much success in getting LLMs to solve, well, basically
             | any problem that involves logic.
             | 
             | Chain of thought at least introduces some skepticism, but
             | that's not exactly reasoning. It makes me wonder what
             | people refer to when they say "reason".
        
               | suddenlybananas wrote:
               | People think an approximation of a thing is the thing.
        
               | therealpygon wrote:
               | As best as I have understood, the LLMs output is directly
               | related to the state of the network as a result of the
               | context. Thinking is the way we use intermediate
               | predictions to help steer the network toward a what is
               | expected to be a better result through learned patterns.
               | Reasoning are strategies for shaping that process to
               | produce even more accurate output, generally having a
               | cumulative effect on the accuracy of predictions.
        
               | MangoToupe wrote:
               | > Reasoning are strategies for shaping that process to
               | produce even more accurate output
               | 
               | How can it evaluate accuracy if it can't even detect
               | contradictions reliably?
        
               | therealpygon wrote:
               | It doesn't? Reasoning is not an analysis; it is the
               | application of learned patterns for a given set of
               | parameters that results in higher accuracy.
               | 
               | Permit my likely inaccurate illustration: You're pretty
               | sure 2 + 2 is 4, but there are several questions you
               | could ask: are any of the numbers negative, are they
               | decimals, were any numbers left out? Most of those
               | questions are things you've learned to ask automatically,
               | without thinking about it, because you know they're
               | important. But because the answer matters, you check your
               | work by writing out the equation. Then, maybe you verify
               | it with more math; 4 / 2 = 2. Now you're more confident
               | the answer is right.
               | 
               | An LLM doesn't understand math per se. If you type "2 + 2
               | =", the model isn't doing math... it's predicting that
               | "4" is the next most likely token based on patterns in
               | its training data.
               | 
               | "Thinking" in an LLM is like the model shifting mode and
               | it starts generating a list of question-and-answer pairs.
               | These are again the next most likely tokens based on the
               | whole context so far. "Reasoning" is above that: a
               | controlling pattern that steers those question-and-answer
               | sequences, injecting logic to help guide the model toward
               | a hopefully more correct next token.
        
         | geraneum wrote:
         | It's true that there are similarities between what you
         | mentioned and what's happening in this case. From the article:
         | 
         | > The result is a test-time loop that looks less like "chat
         | with a compiler" in the case of sequential revision, and more
         | like structured exploratory search, guided by explicit
         | optimization hypotheses and aggressively parallel evaluation.
         | 
         | My conclusion would be that we've now learned to apply LLMs'
         | capabilities to shrink solution space where we have a clear
         | evaluation function as well as solutions to problems that might
         | follow similar patterns. This applies in this case as well.
         | 
         | IMO, It's not about model X gaining on other models or model Y
         | being able to reason about the solutions, etc. in a way that
         | other models couldn't.
        
         | MangoToupe wrote:
         | Interesting. Do you have stronger evidence to support your
         | claim? A sample size of one is pretty unconvincing.
        
       | brrrrrm wrote:
       | what's going to be interesting is to see the large space of fused
       | kernels being tackled by AI generated code. that might include
       | gemm + relu + gemm + a norm of some kind - which would be
       | annoyingly exhaustive to 1. sweep with a tuner and 2. handwrite
       | as a human
        
         | AtlasBarfed wrote:
         | Uh, what is a "kernel" in the sense of AI? Because it sure
         | looks like this isn't an OS kernel.
        
           | philipkglass wrote:
           | This is GPU terminology:
           | 
           | https://cvw.cac.cornell.edu/gpu-architecture/gpu-
           | characteris...
           | 
           |  _A function that is meant to be executed in parallel on an
           | attached GPU is called a kernel. In CUDA, a kernel is usually
           | identified by the presence of the __global__ specifier in
           | front of an otherwise normal-looking C++ function
           | declaration._
        
       | ekelsen wrote:
       | "FP32 is less common in modern ML workloads and often less
       | optimized on recent hardware compared to FP16 or BF16, which may
       | partly explain why it's easier to achieve performance gains over
       | PyTorch with FP32 kernels."
       | 
       | People haven't spent time optimizing the fp32 versions of these
       | kernels in years. This will be much more interesting if they can
       | improve the kernels where developer effort has gone and that are
       | actually used.
        
         | suddenlybananas wrote:
         | I wonder if it's using known improvements from the fp16/bf16
         | kernels that are transferable to fp32?
        
         | moralestapia wrote:
         | >People haven't spent time optimizing the fp32 versions of
         | these kernels in years.
         | 
         | Wow, so, you're basically saying the AI created new algos in a
         | domain with no pre-existing solutions? Awesome!
        
           | Aurornis wrote:
           | No one said the AI created new algorithms nor that there
           | weren't pre-existing solutions.
           | 
           | The implication was that the FP32 versions of these kernels
           | have lagged behind the more popular versions. There was
           | opportunity to translate the advancements from other kernels
           | into these. Someone would need to look closely to see exactly
           | what was done, but it's premature to suggest anything like
           | "new algos" or "no pre-existing solutions"
           | 
           | This is a great use case for LLMs, though. I often do
           | something similar where I make improvements to something I
           | use most frequently and ask an LLM to translate that pattern
           | to other similar parts of the code.
        
             | moralestapia wrote:
             | >The implication was that the FP32 versions of these
             | kernels have lagged behind the more popular versions.
             | 
             | Help me understand this 'cause I'm a bit slow these days
             | ...
             | 
             | Does that mean optimized FP32 versions of these kernels
             | were already there or not?
        
               | almostgotcaught wrote:
               | > Help me understand this 'cause I'm a bit slow these
               | days ...
               | 
               | If I do `sed 's/f32/f16/g' kernel.cu` does this count as
               | AI? Help me understand because I'm a little slow when it
               | comes to all the dumb shit people attribute to LLMs these
               | days...
        
               | moralestapia wrote:
               | Indeed, you're slow on these news.
               | 
               | >sed 's/f32/f16/g' kernel.cu
               | 
               | This is not what's happening here, it's a completely
               | different thing, read TFA.
        
               | imtringued wrote:
               | You are a blatant troll and you know that.
        
               | Dylan16807 wrote:
               | > Does that mean optimized FP32 versions of these kernels
               | were already there or not?
               | 
               | If you're trying to support your original point with that
               | argument, then you're using some pretty awful definitions
               | of the terms "new algos" and "no pre-existing solutions".
        
           | uoaei wrote:
           | The hype cycle in action, folks. Pay heed.
        
           | vlovich123 wrote:
           | The solution not existing in PyTorch does not mean the
           | solution doesn't exist elsewhere on the internet. Remember -
           | PyTorch is largely maintained by employees of companies that
           | have their own priorities for the SW and those priorities may
           | not include hyper optimizing fp32 kernels.
           | 
           | That being said, it is cool if AI is enabling lower cost
           | adoption of better more optimized kernels with less effort.
        
           | imtringued wrote:
           | Read the article before spouting lies. Actually never mind
           | that.
           | 
           | Read the damn comment you're responding to. There have been
           | human written kernels for both fp16 and fp32 for a long time.
           | 
           | Here is the corrected version of your comment:
           | 
           | "Wow, so, you're basically saying the AI created the same but
           | faster algos in a well known domain with established pre-
           | existing solutions, whose overall impact on the runtime of
           | practical workloads is insignificant? Awesome!"
        
         | adrian_b wrote:
         | I believe that these good results are explained at least in
         | part by the fact that NVIDIA does not provide detailed enough
         | documentation for their GPUs.
         | 
         | For a processor with well-documented microarchitecture, for
         | which a programmer or a compiler can deterministically write an
         | optimal program, it is much less likely that applying ML/AI can
         | be successful, except as a substitute for searching already
         | known solutions.
         | 
         | On the other hand, for less documented microarchitectures, like
         | of the NVIDIA GPUs, finding an optimal program may be
         | impossible other than by doing a random search guided by
         | examples of previous optimized programs, and possibly doing
         | some reverse-engineering work to determine the real behavior of
         | the GPU in some circumstances.
         | 
         | Improving over something like this is likely to be feasible for
         | ML/AI, where training over known good programs may be able to
         | extract some of the undocumented behavior that may be non-
         | obvious for humans reading those examples.
        
           | fulafel wrote:
           | Even with full information, we generally (or practically)
           | aren't able to write optimal programs.
        
           | pca006132 wrote:
           | While it is decidable, people typically never produce optimal
           | programs even for the hot path. It is just intractable and
           | too slow to do right now.
           | 
           | For register allocation and instruction selection, there is
           | hope because it is FPT and there are algorithms to do it
           | optimally in polynomial time, albeit with a large constant
           | factor (FPT), making it impractical to apply to compilers as
           | of today. For instruction scheduling, it is just too hard. If
           | you read literature on scheduling algorithms, it is NP-hard
           | for apparently simple instances, e.g., 2 parallel identical
           | machines with no preemption and bounding completion time
           | (https://www2.informatik.uni-osnabrueck.de/knust/class/),
           | while actual microarchitecture is much more complicated than
           | this...
           | 
           | Needless to say, these are already the simpler problems. The
           | longer the program or the more profiling data you can
           | optimize for, the more tricks you can throw at it, and most
           | of them are NP-hard to optimize optimally.
           | 
           | Being NP-hard doesn't imply that you can't obtain the optimal
           | result, but compilers that I know of do not implement them,
           | because most users are not willing to wait for days for such
           | a compilation to complete. Ideally, one should make something
           | that can run on clusters of CPUs or GPUs to optimize this,
           | and people having those clusters will typically be willing to
           | do this because they want to optimize the program they later
           | run on the clusters. However, to my knowledge, no one is
           | working on this at the moment.
        
           | david-gpu wrote:
           | _> For a processor with well-documented microarchitecture,
           | for which a programmer or a compiler can deterministically
           | write an optimal program_
           | 
           | You severely underestimate the landscape of possible
           | implementations for these kernels. There are _many_ ways of
           | performing a matrix multiplication and predicting which one
           | will perform best without running them all is nontrivial,
           | even with perfect knowledge of the underlying system.
           | 
           | This is just a completely incorrect take, speaking as a
           | former insider.
        
           | mjlee wrote:
           | > For a processor with well-documented microarchitecture, for
           | which a programmer or a compiler can deterministically write
           | an optimal program
           | 
           | We don't even know the optimal algorithms! AlphaEvolve
           | recently found "an algorithm to multiply 4x4 complex-valued
           | matrices using 48 scalar multiplications, improving upon
           | Strassen's 1969 algorithm that was previously known as the
           | best in this setting." -
           | https://www.nature.com/articles/s41586-022-05172-4
        
             | hmry wrote:
             | For those who don't want to read the article: The previous
             | best was 49 scalar multiplications.
        
               | mattkrause wrote:
               | And I believe some humans subsequently knocked it down to
               | 47?
        
       | adityamwagh wrote:
       | Sometimes I think of LLMs as kind of a hive mind. It's trained on
       | thought processes of so many humans. I think that's why it's able
       | to do these kinds of things given the fact that it has so much
       | information and context compressed in weights.
        
         | MangoToupe wrote:
         | The market itself is also kind of a hive-mind metaphor. Worth
         | thinking about.
        
           | suddenlybananas wrote:
           | Maybe we could replace it with a central planning now that we
           | can distill information.
        
             | MangoToupe wrote:
             | Whoops you just did a communism
        
               | gpm wrote:
               | A "vertical integration" in the capitalist world ;)
        
               | MangoToupe wrote:
               | This got a legitimate chortle out of me
        
               | yieldcrv wrote:
               | a non-human standing committee following the directives
               | of a trust could work
        
               | MangoToupe wrote:
               | What like you want to govern by divining patterns of
               | snake coils or bird guts?
        
       | constantcrying wrote:
       | >and test for correctness by checking the numerical equality of
       | the two outputs over many random inputs.
       | 
       | This is fundamentally different to how any human would approach
       | this problem. And also different to how some recent advances in
       | this area were made, where AI actually came up with superior and
       | correct algorithms.
       | 
       | This approach also seems quite unfortunate and makes many of
       | theses results somewhat doubtful.
        
         | gotoeleven wrote:
         | How else would you do the verification?
        
           | constantcrying wrote:
           | See: https://www.nature.com/articles/s41586-022-05172-4
           | 
           | IIRC there was another paper recently, with similar
           | methodology about computing xAx. These papers produce
           | algorithms which aren't empirically correct, but provably
           | correct. They do this by operating on a graph data structure,
           | which describes the algorithm and then verifying the
           | algebraic equality to the correct result.
           | 
           | There is a substantial difference here. And I think utilizing
           | algorithms which only are empirically correct can be
           | dangerous.
        
       | ekelsen wrote:
       | "the reference code is in the default FP32, and given a tolerance
       | threshold (1e-02)"
       | 
       | that's a huge tolerance and allows them to use fp16 operations to
       | replace the "fp32" kernel.
        
         | unignorant wrote:
         | yeah, it seems likely the underlying task here (one reasoning
         | step away) was: replace as many fp32 operations as possible in
         | this kernel with fp16. i'm not sure exactly how challenging a
         | port like that is, but intuitively seems a bit less impressive
         | 
         | maybe this intuition is wrong but would be great for the work
         | to address it explicitly if so!
        
           | AlotOfReading wrote:
           | Only seems to have done that in a couple places, like the
           | MatMul. The softmax kernel
           | (https://github.com/ScalingIntelligence/good-
           | kernels/blob/mai...) seem to be entirely bog-standard, and
           | the layernorm kernels are only slightly more interesting.
        
         | constantcrying wrote:
         | This means the results are useless. Did they even check the
         | _relative_ error at all?
         | 
         | Replacing float32 operations with float16 is also pointless.
         | There is nothing to be gained by doing this, as it removes the
         | actual accuracy advantage of float32s, which would the single
         | most important reason to use that version of the algorithm.
        
           | threeducks wrote:
           | I ran their matrix multiplication code from GitHub
           | (https://github.com/ScalingIntelligence/good-
           | kernels/blob/mai...) and got a mean squared error of
           | approximately 0.056 for two 4096x4096 matrices containing
           | random values between 0 and 1.
           | 
           | I think this error is large enough that referring to it as
           | FP32 is misleading.
           | 
           | Also, the performance gains do not translate to my RTX 3060M
           | GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it
           | lacks the optimized hardware for half precision.
           | 
           | But on the plus side, the single file was very easy to adapt
           | and the code is quite readable. I have seen much uglier
           | kernels.
        
         | beyonddream wrote:
         | Why do you think it is a huge tolerance ? (Just curious since
         | it is not clear to me if that will lead to too much of
         | reduction in numerical accuracy compared to the speedup)
        
       | JSR_FDED wrote:
       | Could this be used to create kernels for OpenCL, ROCm, etc?
        
       | vessenes wrote:
       | By far the most interesting part (after the 400% speed up in some
       | cases) is the methodology: rather than hill climb on operations,
       | they forced a language reasoning step between iterations to
       | encourage diversity of search. This seems to have worked. Very
       | very interesting.
        
         | lucidrains wrote:
         | oh wow, I was looking for use of islands or map-elites that I
         | missed this.. thought it was the blandest mimetic evolution
         | possible
        
           | vessenes wrote:
           | Just anecdotally I feel like hill climbing on operations is
           | just so slow; I'm not saying it doesn't work, but it always
           | feels one step away from brute force search. I really like
           | the idea of just throwing stuff at the LLM and giving it
           | access to old strong variants in context.
        
       | david-gpu wrote:
       | Disclaimer: This used to be my bread and butter, but I'm _really_
       | rusty after five years of not working on this sort of stuff.
       | 
       | That said, after quickly skimming the example AI-generated kernel
       | I am not seeing anything novel there. While working at nVidia I
       | did see a handful of techniques that, frankly, blew my mind.
       | 
       | Thus, I wonder what makes this AI-generated kernel faster than
       | the standard pyTorch kernel, which I presume is simply delegating
       | all the heavy lifting onto cuDNN. My guess, and it's just a
       | guess, is that they are comparing the fastest AI-generated kernel
       | they produced for a very particular set of parameters against
       | whatever kernel cuDNN is picking for that same scenario, and
       | perhaps the subsystem inside cuDNN that picks which kernel to
       | execute out of the very large database it manages chose a
       | suboptimal candidate. Researchers tend to completely ignore this
       | issue and assume that cuDNN is always able to choose the very
       | best kernel in every possible scenario, something that is just
       | not realistic.
       | 
       | Maybe there is something else going on, but these sort of _" we
       | have beaten this heavily optimized proprietary library"_ always
       | seem to miss this very important point.
       | 
       | Kind regards to any NVidia insiders who may read this. You guys
       | are the brightest people I've ever met.
        
         | zahlman wrote:
         | > Thus, I wonder what makes this AI-generated kernel faster
         | than the standard pyTorch kernel
         | 
         | All of this stuff is way outside my wheelhouse, but maybe "the
         | standard pyTorch kernel" is just a low bar?
         | (https://news.ycombinator.com/item?id=44144346)
        
       | MangoToupe wrote:
       | > Our results are benchmarked on an Nvidia L40S
       | 
       | At the very least they could have used consumer hardware. I don't
       | even know how to parse that model it's so consumer-alien.
        
         | Maxious wrote:
         | It's the AD102 chipset ie. RTX 4090 with perfect binning (like
         | the never released 4090 Ti would have had) and 48GB of VRAM
         | soldered on https://www.techpowerup.com/gpu-
         | specs/?architecture=Ada%20Lo...
        
       | klingenm wrote:
       | This sounds more like using AI (llm) as one small step, where the
       | randomness in the output is used to implement a Genetic
       | Algorithm, than being "AI-generated" (admittedly technically
       | correct).
       | 
       | (Edit, typo)
        
       | FL33TW00D wrote:
       | Tried a replication here. The LayerNorm kernel is not numerically
       | stable so cannot be counted as valid. They only test with zero
       | mean and unit std, so the catastrophic cancellation doesn't show
       | up until after.
       | 
       | EDIT: looks like they've since generated another one that is
       | numerically stable! great work
        
       | userbinator wrote:
       | Am I the only one who was enticed into this article by thinking
       | they had AI generate an OS kernel?
        
         | dgfitz wrote:
         | Nope, I was as well.
        
       | miki123211 wrote:
       | I think how the authors of this post think about "AI agents" is
       | really interesting.
       | 
       | Most people think of agents like they think of human employees.
       | They set up a limited number of agents to run in parallel (often
       | just one), with each agent running in a loop and doing one task
       | at a time. They're still in a world where you have a fixed (on
       | the timescale of hours or days) number of employees, each
       | employee can only do one thing at a time, and transferring tasks
       | between employees is slow and costly.
       | 
       | LLMs don't really work like that. You effectively have an
       | infinite number of agents that you can conjure out of thin air at
       | any time. There's no cost advantage to performing LLM requests in
       | series rather than in parallel.
       | 
       | If you realize this, the pattern of each agent fanning out and
       | forking itself into as many sub-agents as are needed to fulfill
       | the task becomes obvious. This is exactly what the authors have
       | done.
       | 
       | I think a better way to think of agents is as "tasks" or "jobs",
       | like those you might find in Celery or sidekik, and apply the
       | learnings from those.
        
         | neom wrote:
         | For fun last month I decided to see if i could build a fully
         | functional business of agents. It's 175 python files
         | (employees) build up of internal roles within those files
         | (tasks). So what I have is 175 employees who are able to pass
         | output around each other, understand the work, complete the
         | work, understand where to send the output. The whole system has
         | the ability to do around 275 base processes (same as a business
         | at > 100MM arr) I started on a Friday afternoon and slept a
         | little bit and finished on Monday afternoon. After I had it
         | running I sent it to a VC friend to show them and they sent
         | back the deck of a startup that is in stealth with $25MM doing
         | it the _exact_ same way. With 1 month and a designer and an
         | engineer, I could have it mvp functional for anyone to use
         | ($40k?). Times are changing. Here is kinda how it looks:
         | https://s.h4x.club/9ZuO4XQR / https://s.h4x.club/jkuB8ZED (I've
         | evolved it a little since this, and if you're an engineer and
         | look at my files and think, this guy is a moron: I know!:))
        
           | literalAardvark wrote:
           | Engineers who would judge someone's frontier MVP like that
           | are not even worth worrying about.
           | 
           | This is epic work. Would love to see more of it but I guess
           | you're gonna take it the startup route since you have
           | connections. Best of luck.
        
             | neom wrote:
             | Thanks!!! I decided not to build it, that space is already
             | too busy, there is a startup with $25MM in stealth, who
             | else is in stealth? On top of that, this method will get
             | stale very very quickly, foundation model businesses are
             | just too hard to work around right now, it's a silly way to
             | do business. My magic is I've build a startup from scratch
             | to over 400 people and watched what they do, it won't be
             | long till that isn't worth much.
        
           | wslh wrote:
           | Your message doesn't make it clear what those 175 employees
           | can realistically accomplish on their own.
           | 
           | For instance, you might have an SEO expert on the team, but
           | that alone won't guarantee top search engine rankings. There
           | are countless SEO professionals and tools (human or AI-
           | powered), and even having the best one doesn't eliminate the
           | underlying challenge: business competition. LLMs, like any
           | other tool, don't solve that fundamental problem.
        
             | neom wrote:
             | No employees accomplish anything on their own in the real
             | world, all employees are part of a team. That's why I
             | designed a business strategy and analysis layer (over half
             | the system, in fact), with web tools and connections to all
             | of the insights systems (like mix panel). I built the
             | _exact_ same thing I build at digitalocean but instead of
             | humans I defined them with code, digitalocean runs just
             | fine, so does my LLM system. The whole system I build is
             | self learning, insight gathering and refinement.
             | Competition is for losers, the best teams win via the best
             | insights.
        
               | vasco wrote:
               | Why 175? Why not 5 billion employees? Why not 20000
               | companies in parallel? Why not simulate 5 earth's worth
               | of history and setup a full universe of worlds full of
               | startups?
               | 
               | This sounds like those guys in social media that one up
               | each other with their bed times and end up saying they
               | wake up every day at 2am to meditate and work out
        
               | neom wrote:
               | Because that was the scope of the project. When we got to
               | 400 employees at DigitalOcean I noticed I thought it was
               | really half that, original I just sat out to make the
               | marketing and strategy team, but got bit carried away,
               | the fp&a team was the only group I really struggled with,
               | my cfo skills are very meh.
        
               | vasco wrote:
               | 1 single agent with a good model is going to beat that
               | approach every single time. The same way Whatsapp needed
               | only 55 people (and probably the last hires were not
               | needed for the outcome) to sell for $19B.
               | 
               | And other companies have existed for hundreds of years
               | and had thousands of people work for them and never even
               | made $100M.
        
               | neom wrote:
               | I'm confused what you're saying. There a loads of
               | markets, loads of segments, loads of ways to do unit
               | economics, yes, but business is business, it's
               | prescriptive at it's core. I'm using a single model, it's
               | just openai calls using the role function.
        
           | yusina wrote:
           | > understand the work
           | 
           | LLMs don't _understand_. It 's mind-boggling to me that large
           | parts of the tech industry think that.
           | 
           | Don't ascribe to them what they don't have. They are
           | fantastic at _faking_ understanding. Don 't get me wrong, for
           | many tasks, that's good enough. But there is a fundamental
           | limit to what all this can do. Don't get fooled into
           | believing there isn't.
        
             | neom wrote:
             | What is the limit my system will reach?
        
             | rzz3 wrote:
             | Thats an interesting word to pick on. Understanding still
             | means something here in a relative sense.
        
             | zenburnmyface wrote:
             | meh. I feel this is just a linguistic shortcut, similar to
             | how _trained_ biologists can talk about a species or
             | organism evolving some trait. Of course the organism isn't
             | _really_ evolving with any goal in mind, but that's clear
             | to the speaker and audience. Whether or not LLMs understand
             | (very unlikely), it's clear what we mean by an LLM
             | "understanding": has the context + prior training to make
             | reasonable predictions. But no one wants to write that each
             | time.
        
               | yusina wrote:
               | That's an interesting take and in fact one I could get
               | behind.
               | 
               | But I'm afraid that most folks using the term mean it
               | more literally than you describe.
        
               | philistine wrote:
               | Exactly. The whole point of all the LLM companies is to
               | get grandma to use it. If you say _understand_ about a
               | technology with the desired appeal of Facebook, then
               | you're talking to everyone and words matter extra hard.
        
             | motorest wrote:
             | > LLMs don't understand. It's mind-boggling to me that
             | large parts of the tech industry think that.
             | 
             | I think you might be tied to a definition of
             | "understanding" that doesn't really apply.
             | 
             | If you prompt a LLM with ambiguous instructions, it
             | requests you to clarify (i.e., extend prompt to provide
             | more context) and once you do the LLM outputs something
             | that exactly meets the goals of the initial prompt, does it
             | count as understanding?
             | 
             | If it walks like a duck and quacks like a duck, it's a
             | duck,or something so close to a duck that we'd be better
             | off calling it that.
        
               | acchow wrote:
               | > If you prompt a LLM with ambiguous instructions, it
               | requests you to clarify (i.e., extend prompt to provide
               | more context)
               | 
               | It does not understand that it needs clarification. This
               | behavior is replicated pattern
        
               | AlecSchueler wrote:
               | What is the difference? What would actual understanding
               | look like?
        
               | accCer wrote:
               | It depends on which human feedback was used to train the
               | model. For humans, there are various communication models
               | like the four-sides model. If the dataset has annotations
               | for the specific facets of the communication model, then
               | an LLM trained on this dataset will have specific
               | probabilities that replicate that communication model.
               | You may call this understanding what the prompter says,
               | but it's just replication for me.
        
               | bandrami wrote:
               | The difference comes when it receives novel input
        
               | Spivak wrote:
               | So you have two prompts, one is ambiguous and the second
               | is the same prompt but with the ambiguity resolved.
               | 
               | In the first prompt the replicated pattern is to ask for
               | clarification, in the second prompt the replicated
               | pattern is to perform the work. The machine might
               | understand nothing but does it matter when it responds
               | appropriately to the different cases?
               | 
               | I don't really care whether it understands anything at
               | all, I care that the machine behaves as though it did
               | have understanding.
        
             | vasco wrote:
             | Asking a short question but in a serious way: so what?
        
               | yusina wrote:
               | You are asking why it is meaningful to use terms for what
               | they mean instead of making up things?
               | 
               | Well, I prefer it that way, but the spirit of "AI" seems
               | to go in another direction, and the leadership of US
               | government also does, so maybe times are just changing.
        
             | hayst4ck wrote:
             | Nearly every argument like this has the same fatal flaw,
             | and it's generally not the critique of the AI, but the
             | critique reflected back on to humans.
             | 
             |  _Humans also don 't understand_ and are frequently faking
             | understanding, which for many tasks is good enough. There
             | are fundamental limits to what _humans_ can do.
             | 
             | The AI of a few months ago before OpenAI's sycophancy was
             | quite impressive, less so now which means it is being
             | artificially stunted so more can be charged later. It means
             | privately it is much better than what is public. I can't
             | say it "understands," but I can say it outclasses many many
             | humans. There are already numbers of tasks based around
             | understanding where I would already choose an LLM over a
             | human.
             | 
             | It's worth looking at bloom's taxonomy
             | (https://en.wikipedia.org/wiki/Bloom%27s_taxonomy): _In the
             | 2001 revised edition of Bloom 's taxonomy, the levels were
             | renamed and reordered: Remember, Understand, Apply,
             | Analyze, Evaluate, and Create._ In my opinion it is at
             | least human competitive for everything but create.
             | 
             | I used to be very bearish on AI, but if you haven't had a
             | "wow" moment when using one, then I don't think you've
             | tried to explore what it can do or tested it's limits with
             | your own special expertise/domain knowledge, or if you have
             | then I'm not sure we're using the same LLMs. Then compare
             | that experience to normal people, not your peer groups.
             | Compare an LLM to people into astrology, crystal healing,
             | or homeopathy and ask which has more "understanding."
        
               | yusina wrote:
               | Um, moving the goal post?
               | 
               | The claim was LLMs understand things.
               | 
               | The counter was, nope, they don't. They can fake it well
               | though.
               | 
               | Your argument now is, well humans also often fake it.
               | Kinda implying that it means it's ok to claim that LLMs
               | have understanding?
               | 
               | They may outclass people in a bunch of things. That's
               | great! My pocket calculator 20 years also did, and it's
               | also great. Neither understands what they are doing
               | though.
        
               | neom wrote:
               | It's fun to talk about, but personally he whole
               | "understanding" debate is a red herring, imo what we
               | actually care about when we talk about intelligence is
               | the capacity and depth of: second order thinking,
               | regardless of the underlying mechanism. I think
               | personally key question isn't "do LLMs understand?" but,
               | "can LLMs engage in second order thinking?" The answer
               | seems to be yes - they can reason about reasoning, plan
               | their approaches, critique their own outputs, and adapt
               | their strategies, o1 has shown us that with RL and
               | reasoning tokens you can include it in a single system,
               | but our brains have multiple systems we can control and
               | that can be combined in multiple ways at any given
               | moment: emotions, feelings, thoughts combined into user
               | space, 3 core systems input, memory, output. The nuances
               | is in the fact that various reasons (nature + nurture),
               | various humans appear to have varying levels of meta
               | control over the multiple reasoning systems.
        
               | perching_aix wrote:
               | Why are you pretending to be participating in a debate?
               | You mention things like "moving the goalpost",
               | "counter[arguments]", and "arguments", as if you did
               | anything more than just assert your opinion in the first
               | place.
               | 
               | This is what you wrote:
               | 
               | > LLMs don't understand.
               | 
               | That's it. An assertion of opinion with nothing else
               | included. I understand it sucks when people feel
               | otherwise, but that's just kinda how this goes. And
               | before you bring up how there were more sentences in your
               | comment, I'd say they are squarely irrelevant, but sure,
               | let's review those too:
               | 
               | > It's mind-boggling to me that large parts of the tech
               | industry think that.
               | 
               | This is just a personal reporting of your own feelings.
               | Zero argumentational value.
               | 
               | > Don't ascribe to them what they don't have.
               | 
               | A call for action, combined with the same assertion of
               | opinion as before, just rehashed. Again, zero
               | argumentational value.
               | 
               | > They are fantastic at faking understanding.
               | 
               | Opinion, loaded with the previous assertion of opinion.
               | No value add.
               | 
               | > Don't get me wrong, for many tasks, that's good enough.
               | 
               | More opinion. Still no arguments or verifiable facts
               | presented or referenced. Also a call for action.
               | 
               | > But there is a fundamental limit to what all this can
               | do.
               | 
               | Opinion, and a vague one at that. Still nothing.
               | 
               | > Don't get fooled into believing there isn't.
               | 
               | Call for action + assertion of opinion again. Nope, still
               | nothing.
               | 
               | It's basically the kind of comment I wish I could just
               | have filtered out before it ever reached me. Zero
               | substance, maximum emotion. This is no way to discuss
               | anything, let alone something you or others likely feel
               | strongly about.
        
               | roryirvine wrote:
               | I do agree with you - but the big difference is that
               | humans-who-are-faking-it tend to learn as they go so
               | might, with a bit of effort, be expected to understand
               | eventually.
               | 
               | Does that actually matter? Probably not for many everyday
               | tasks...
        
               | squidbeak wrote:
               | Excellently put.
        
             | bobxmax wrote:
             | How do you know?
        
               | yusina wrote:
               | Extraordinary claim requires extraordinary proof. I don't
               | know, but I'm also not the one claiming something.
               | 
               | (Besides, we _know_ what LLMs do, and none of those
               | things indicate understanding. Just statistics.)
        
               | shawabawa3 wrote:
               | You can create a new game with new rules never seen
               | before
               | 
               | You can explain this to an LLM
               | 
               | The LLM can then play the game following the rules
               | 
               | How can you say it hasn't understood the game?
        
               | yusina wrote:
               | The LLM is only capable of doing so if it has encountered
               | something similar before as part of its training.
               | 
               | Claiming anything else requires a proof.
        
               | neom wrote:
               | the extraordinary claim would be that LLMs can only do
               | things they've seen before exactly, given the
               | compositional and emergent capabilities we observe. The
               | evidence suggests they can generalize beyond their
               | training in meaningful ways, even if imperfectly...if a
               | human came out living but with a brain that had zero
               | electrical activity, that would be extraordinary, we
               | normally come out with a baseline of pre-programming. I
               | sometimes think this debate happens because humans don't
               | want to admit we're nothing more than LLMs programmed by
               | nature and nurture, human seem to want to be especially
               | special.
               | 
               | https://arxiv.org/abs/2206.07682
               | 
               | https://towardsdatascience.com/enhanced-large-language-
               | model...
               | 
               | https://arxiv.org/abs/2308.00304
               | 
               | (and if MoRA is moving the goal posts, fine: RL/RT)
        
               | yusina wrote:
               | >if a human came out living but with a brain that had
               | zero electrical activity, that would be extraordinary, we
               | normally come out with a baseline of pre-programming.
               | 
               | That statement reveals deep deficiencies in your
               | understanding of biological neural networks. "electrical
               | activity" is very different from "pre-programming".
               | Synapses fire all the time, no matter if meaningfully
               | pre-programmed or not. In fact, electrical activity
               | decreases over time in a human brain. So, if anything,
               | programming over time reduces electrical activity (though
               | there is no established causal link).
               | 
               | > I sometimes think this debate happens because humans
               | don't want to admit we're nothing more than LLMs
               | programmed by nature and nurture, human seem to want to
               | be especially special.
               | 
               | It's not specific to humans. But indeed, we don't fully
               | understand how brains of humans, apes, pigs, cats and
               | other animals really work. We have some idea of synapses,
               | but there is still a lot unclear. It's like thinking just
               | because an internal combistion engine is made of atoms,
               | and we mostly know how atom physics and chemistry work,
               | that any body with this basic knowledge of atom physics
               | can understand and even build an ICE. Good luck trying.
               | It's similar with a brain. Yes, synapses play a role. But
               | that doesn't mean a brain is "nothing more than an LLM".
        
               | neom wrote:
               | Neural activity begins around 6 weeks gestation,
               | electrical patterns help establish basic neural circuits,
               | activity dependent neural development shapes connectivity
               | before any sensory input, critical periods where
               | electrical activity literally sculpts brain architecture.
               | Motor patterns get programmed before birth (why babies
               | can suck, grasp, etc.), language processing areas develop
               | structural biases before hearing language, visual cortex
               | develops orientation maps before seeing anything, basic
               | learning algorithms get "wired in" through developmental
               | processes. If a human emerged, was able to function in
               | the world, do things, but had zero electrical activity in
               | the brain, that would be... normal? No: extraordinary.
               | 
               | Humans arrive out of the VJJ with innate neural
               | architectures to be filled and developed - not literal
               | blank slates, there is an OS. The electrical activity
               | during development is literally the biological process
               | that creates our "base programming." LLMs have
               | architectural inductive biases (attention mechanisms,
               | etc.), human brains have evolved architectural biases
               | established through fetal development. We're both "pre-
               | programmed" systems, just through different mechanisms.
               | 
               | Your response about "electrical activity decreases over
               | time" is irrelevant - you weren't talking about adult
               | brain activity, you were talking about the developmental
               | process that creates our initial neural architecture.
               | 
               | tbh: I can't tell if you're engaging in good faith or
               | not.
        
             | GaggiX wrote:
             | They understand tho, it's different than how it's done in
             | our brain but they solve task that would be impossible to
             | do without understanding. I would even say that they can
             | now reason through problems thanks to powerful reasoning
             | models like Gemini 2.5 Pro and o3.
        
             | GoatInGrey wrote:
             | I don't believe the user meant "understand" in the
             | classical biological and philosophical sense, or were
             | otherwise attempting to anthropomorphize the systems. They
             | were speaking from the practical experience of "this thing
             | takes a somewhat ambiguous input with unique constraints
             | and implements the ask more-or-less as intended".
        
             | squidbeak wrote:
             | They understand. Anything able to reason about any
             | arbitrary request and form a plan tailored to that request
             | understands well enough to qualify for the verb. The
             | mechanism behind it may feel hollow or fake. But if its
             | responses reliably show understanding, the LLM understands
             | - by any ordinary measure.
        
             | holoduke wrote:
             | Definition of understanding is based on connecting
             | relations. If there is one thing a llm can do its
             | connecting relations. So I am not sure why you say llms are
             | not understanding.
        
           | immibis wrote:
           | Does this experiment do anything useful or does it just soak
           | up investor money? Not that there's anything wrong with the
           | latter.
        
             | neom wrote:
             | The only investor is me. I build it on my own over a
             | weekend, on my own. I just wanted to confirm it can be done
             | therefore will exist, that is all. Personally, I decided
             | not to peruse it because I am old and lazy and don't want
             | to compete against a16z and sequoia funded adderall filled
             | teenagers.
        
               | immibis wrote:
               | I meant the one that investors are paying for.
        
           | acchow wrote:
           | > The whole system has the ability to do around 275 base
           | processes
           | 
           | It's incredibly easy to get LLMs to do a lot of _stuff_ that
           | seems convincing.
           | 
           | They are literally trained for plausibility.
        
           | robbomacrae wrote:
           | Is anyone else annoyed that VC's are out there sharing decks
           | of startups in stealth with potential competitors? How often
           | does this happen?
        
             | eterm wrote:
             | I would be annoyed along with you if I thought the post was
             | true.
        
               | IncreasePosts wrote:
               | It's not a lie, it is just vibe posting
        
           | iammrpayments wrote:
           | Sounds really interesting but I have no idea what's the
           | purpose of having 175 "employees" here? Maybe it is a smart
           | way to sell the idea you're going to replace 175 people if
           | you buy the product? Could just buy chatgpt instead I guess,
           | but a chatbot doesn't sound as cool as 175 employees.
        
             | neom wrote:
             | I would love to know how to do it another way if you have
             | any ideas, I'm sadly not experienced or intelligent enough
             | to think of another way to do it.
        
           | jprokay13 wrote:
           | I've been floating around a similar set of ideas and it's
           | been very fun (if not all that useful yet) to build Did you
           | try taking it one step further where a "recruiter" has to
           | hire the engineers after a screening process? I wonder if
           | this could get you even better AI engineers
        
           | catlifeonmars wrote:
           | This really sounds like a "faster horse" scenario and totally
           | misses the point of the GPs comment: why shackle yourself to
           | modeling the way humans work?
        
           | mucha wrote:
           | Cool. What goods/services does your business provide to
           | customers?
        
         | londons_explore wrote:
         | > forking itself into as many sub-agents as are needed to
         | fulfill the task
         | 
         | The forking is free. Running the sub-agents is linear cost, but
         | the expensive bit is joining the agents responses back together
         | again.
         | 
         | If a task has 6 subtasks and an agent is spawned for each, at
         | some point some 'joiner' agent needs to parse and summarize the
         | findings of the sub agents and feed it back to the parent. That
         | step necessarily involves information loss, and uses more
         | computation that a single linear agent design would not use.
        
           | neom wrote:
           | I designed something for a business and found I needed 4
           | major sub-systems (like a real business) - insight/data,
           | cognition, meta cognition and execution, and if you don't
           | define all 4, the system is junk.
        
             | motorest wrote:
             | > I designed something for a business and found I needed 4
             | major sub-systems (like a real business) - insight/data,
             | cognition, meta cognition and execution, and if you don't
             | define all 4, the system is junk.
             | 
             | Might it be just another realization of Conway's law?
             | 
             | https://en.wikipedia.org/wiki/Conway%27s_law
             | 
             | Might it be possible that the only reason you're assuming a
             | system is junk is just that it doesn't resemble the systems
             | you know and expect? There are so many ways to skin a cat,
             | and certainly no business process represents the optimal
             | process.
        
         | viraptor wrote:
         | > They set up a limited number of agents to run in parallel
         | (often just one),
         | 
         | Most of what people use agents for daily can often be one-
         | shotted though and even collating/rating 10 results would be
         | costly.
         | 
         | If I had a harness for evaluating the results and VC level
         | money, I'd be throwing an army at well defined experimental
         | tasks as well.
        
         | yusina wrote:
         | > You effectively have an infinite number of agents
         | 
         | You don't.
         | 
         | Sincerely, Your Electricity Bill
        
         | TimPC wrote:
         | The challenge with fan out is constructing a linear
         | conversation that makes sense that captures previous history.
         | In any context where the LLM needs that information linear
         | loops often perform better than trying to splice together
         | conversations from multiple parallel processes.
        
         | kposehn wrote:
         | This is similar to something we've been doing for a while.
         | Instead of individual agents we are creating many iterations
         | and sub-iterations of spawned agents that are largely
         | autonomous. A lot of the human-centric paradigms just don't
         | really apply to LLMs/AI but people are used to approaching them
         | that way.
        
       | bgwalter wrote:
       | > They are performing close to or in some cases even beating the
       | standard expert-optimized production kernels shipped in PyTorch.
       | 
       | The PyTorch code base is NOT written by performance experts in
       | any way. This is the wrong base line. Nothing about that code
       | base is clean or hand optimized .
       | 
       | The "AI" generation methodology seems to give many instructions
       | and even descends into instruction trees, manually throwing away
       | results etc. So it requires, as usual, extreme guidance.
        
       | poltomo wrote:
       | Beating pytorch and tensorflow kernels has been easy to do with
       | ml compilers since ~2018. You typically train and evaluate your
       | model in one of these frameworks then hand off the computation
       | graph to a compiler like Apache TVM or your hardware vendor's
       | proprietary one. They should test their kernels against those
       | kernels.
       | 
       | ML guided heuristic search over compute schedules is as old as
       | 2013 (Halide for image processing)
        
       ___________________________________________________________________
       (page generated 2025-05-31 23:00 UTC)