[HN Gopher] FP8 is ~100 tflops faster when the kernel name has "...
       ___________________________________________________________________
        
       FP8 is ~100 tflops faster when the kernel name has "cutlass" in it
        
       Author : limoce
       Score  : 220 points
       Date   : 2025-07-11 10:36 UTC (12 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | KomoD wrote:
       | actual link: https://github.com/triton-lang/triton/pull/7298
        
         | bede wrote:
         | Thank you, perhaps the parent can be edited to use this URL
         | instead
        
       | orlp wrote:
       | GenuineIntel moment.
        
         | hofrogs wrote:
         | I'm interested in that story, what are you referring to with
         | "GenuineIntel"?
        
           | orlp wrote:
           | Intel's C++ compiler is known to add branches in its
           | generated code checking if the CPU is "GenuineIntel" and if
           | not use a worse routine: https://en.wikipedia.org/wiki/Intel_
           | C%2B%2B_Compiler#Support....
        
             | pieterbreed wrote:
             | Is this for the runtime of the compiled code or for the
             | compiling machine? Do they generate slow code if the
             | compiler is running on non-intel?
        
               | SSLy wrote:
               | the runtime. patching cpuid makes the code go faster
        
               | kstrauser wrote:
               | For the compiled code. Its output deliberately runs
               | slower on non-Intel CPUs.
        
               | Uvix wrote:
               | Runtime of the compiled code. The ostensible intent is so
               | that new processors can use new features like SIMD, while
               | offering a fallback for older ones. In practice, they're
               | detecting an Intel processor, not just the specific
               | feature.
        
             | danieldk wrote:
             | Also MKL:
             | 
             | https://danieldk.eu/Intel-MKL-on-AMD-Zen
        
             | bayindirh wrote:
             | Even in the middle of that turmoil, we managed to compile
             | some code with Intel's ICC and make it go faster on AMD
             | Opterons, breaking Intel's own numbers.
             | 
             | When my colleague said that they managed to go faster than
             | intel with icc with some hand tuned parameters, I remember
             | answering "youdidwat?".
             | 
             | Good times.
        
         | reitzensteinm wrote:
         | Or maybe Quack III: Arena. https://m.slashdot.org/story/21054
        
           | iforgotpassword wrote:
           | I think that was the first case (to go public), but I
           | remember reading about this in game magazines a couple times
           | after this, for both ATI and nvidia.
        
           | 42lux wrote:
           | Now I want a Quake shooter but with ducks.
        
             | carlos22 wrote:
             | Not ducks, but chickens, was very popular in Germany back
             | in the day: https://en.wikipedia.org/wiki/Crazy_Chicken
        
               | avhception wrote:
               | Oh wow, that was a blast from the past. The Moorhuhn
               | craze!
               | 
               | Many people, including me, didn't have an internet
               | connection back in the day. The Sneakernet went into
               | overdrive so get everyone a copy!
        
             | supportengineer wrote:
             | A Duck Hunt, if you will...
        
           | dahauns wrote:
           | Aah, that brings back memories...
           | 
           | Interestingly, most benchmark controversies back in the day
           | are now expected behaviour, i.e. game-specific optimizations
           | with no (well, in this age of upscalers and other lossy
           | optimization techniques, probably even somewhat) visible
           | image degradation. A gaming-specific driver with no game-
           | specific improvements in its changelog would be considered
           | strange, and it very much works with executable detection.
           | 
           | Back in the day, there was still the argument that drivers
           | should not optimize for benchmarks even when visually
           | identical, because it wouldn't show the hardware's real world
           | potential. Kinda cute from today's perspective. :)
           | 
           | But of course there were the obvious cases...
           | 
           | The Quack3 lowering filtering quality as shown above, of
           | course (at least that one was put into the driver as a
           | togglable setting later on).
           | 
           | But the most cheeky one has to be nVidia's 3dmark03
           | "optimizations", where they blatantly put static clip planes
           | into the scenes so that everything outside the predefined
           | camera path from the benchmark sequence would simply be cut
           | from the scene early (which e.g. fully broke the freelook
           | patched into 3dmark and would generally break any interactive
           | application)
        
             | bayindirh wrote:
             | You beat me to it. _Grrr..._
             | 
             | Just kidding, nice to see another person who remembers
             | these things. Want some root beer?
        
           | bayindirh wrote:
           | Ooh, I remember this, but actually the thing is older than
           | it.
           | 
           | First, nVidia and ATI used executable names for detecting
           | games, then they started to add heuristics.
           | 
           | If you think they stopped the practice, you're _very_
           | mistaken. Every AMD and nVidia driver has game and app
           | specific _fixes and optimizations_.
           | 
           | nVidia cheated in 3D Mark that way, so they patched/changed
           | their benchmark to prevent it. Also, again nVidia, patched
           | their drivers so some of the more expensive but visually
           | invisible calls like scene flushes in a particular game is
           | batched (e.g. do all 50 flushes at the 50th call) to prevent
           | the game becoming a slide show on expensive hardware.
           | 
           | This is also why AMDs and Intel's open source drivers under
           | Linux a success, because they are _vanilla_ drivers written
           | from scratch per spec, and if your code calls OpenGL /Vulkan
           | to spec, then you're golden.
           | 
           | Even some companies cross compile AMD's Linux drivers for
           | windows on embedded systems since they're free from useless
           | optimizations from them.
        
       | koakuma-chan wrote:
       | is 100 tflops a lot?
        
         | brightmood wrote:
         | yea
        
         | saagarjha wrote:
         | It's like 5-10% here
        
           | irrelative wrote:
           | Correct, this is the actual headline too. 100 tflops sure
           | seems like it'd be more than that, but here we are.
           | 
           | If the headline was "FB8 is ~7% faster when kernel name has
           | 'cutlass' in it...", it wouldn't seem sensational.
        
             | saagarjha wrote:
             | I think the interesting part is that it improves
             | performance measurably at all, not the actual number. These
             | people are trying to hit 90+% MFU (though most don't reach
             | it) so this does actually translate to many millions of
             | dollars for them.
        
         | progx wrote:
         | 5060 ti +~15%
        
         | HideousKojima wrote:
         | According to Terminator 3 Skynet used a mere 60 TFLOPS
        
           | IAmBroom wrote:
           | How much is that in jiggawatts per parsec?
        
       | nolok wrote:
       | Intel's quest to move from "trusted by default / the reference"
       | to "check for scam" is getting worse every release. And it's 100%
       | self inflicted. How weird.
        
         | pkhuong wrote:
         | NVIDIA-inflicted in this case.
        
         | aleph_minus_one wrote:
         | In my understanding of the PR, it rather seems that it is
         | _NVidia_ is the company that is cheating. :-)
        
       | hvenev wrote:
       | In `libnvidia-nvvm.so` the string `cutlass` appears right after
       | `Memory Dependence Analysis` and `memdep`. Perhaps it acts as an
       | optimization attribute of some sort, where the compiler is
       | allowed to make assumptions about the kernel's behavior that are
       | not valid in general?
        
         | high_na_euv wrote:
         | Thats very likely imo
        
         | jdright wrote:
         | yes, that is a very usual way (known practices) of vendors
         | applying specific optimizations for known things.
         | 
         | It is also part of the benchmarks game they play against each
         | other.
        
           | MichaelZuo wrote:
           | It's really strange for established companies to waste their
           | credibility on games like that...
        
             | MangoToupe wrote:
             | I was pretty young at the time, but I recall the market for
             | graphics being a _lot_ wider open at the time Quake was
             | released. Remember 3dfx? They produced the Voodoo series of
             | graphics cards. They 're barely a distant memory now.
             | 
             | Quake was also _the_ standard for a game that was willing
             | to fully exploit the hardware of the time.
        
             | IAmBroom wrote:
             | Never underestimate how much human ego will control
             | actions.
        
           | MBCook wrote:
           | The link is long dead and the Wayback machine doesn't have a
           | copy.
           | 
           | But in 2001 ATI was caught applying optimizations to Quake 3
           | when someone realized if you renamed the executable from
           | "quake" to "quack" the score dropped a ton. It was a big
           | scandal.
           | 
           | I know that's common now but that wasn't a thing that was
           | done at the time.
        
             | hinkley wrote:
             | There are bugs that certain games rely on and features that
             | some don't use. I'm currently trying to optimize a library
             | out of spite. (I want it to work better than the competitor
             | that caused me a lot of problems on a recent project). The
             | amount of conditional logic around what is essentially a
             | function to increment a value is breathtaking.
        
               | gatlin wrote:
               | Do you have any kind of example you're able to share? I
               | don't mean to take your IP but I want to see this
               | breathtaking vista.
        
               | hinkley wrote:
               | To avoid doxxing myself: In a deep call stack it's
               | possible to end up sanitizing inputs multiple times and
               | in different ways.
               | 
               | A frequent example I've encountered is web frameworks
               | that have to keep checking for escaped text because they
               | didn't write it in horizontal layers where you know for
               | sure that all inputs have been scrubbed when they reach
               | this function but not that one. So the same functions get
               | called with data that comes from your team and from
               | customers. Reuse is tricky.
        
               | hamburglar wrote:
               | "Checking for escaped text" is the sort of nonsense that
               | tells you you're dealing with amateur developers.
        
               | amiga386 wrote:
               | A simple example would be that the function
               | glGetString(GL_EXTENSIONS) crashes the original Quake
               | engine and many licensees, because it's expecting no more
               | than a 256 character string.
               | 
               | The driver looks to see if a known old game is calling
               | it, and if it's one known to crash, it returns no more
               | than 256 characters, and likely also puts all the _old_
               | extensions that the game is likely to know and react to
               | in the string.
               | 
               | There are also all sorts of games that called APIs in a
               | particular order or set particular options, because they
               | represented a "fast path" at the time, and now they
               | don't, but if you're that program, then yes they do.
               | 
               | Ultimately, this clutter is what let do the development
               | of the Vulcan API, to stop games second-guessing graphics
               | APIs which themselves second-guess the games.
        
             | IAmBroom wrote:
             | In at least one past version of Windows (circa 1990s), if
             | you tried to replace the default web browser of IE with
             | another choice you were given an Open File dialog window to
             | choose the executable.
             | 
             | Funny quirk, though: that particular window wouldn't show
             | files named firefox.exe. It would accept that as typed
             | input, if you were at the correct folder, but the file
             | listing omitted that particular file.
             | 
             | Maybe it was mozilla.exe; it was a long time ago. But that
             | was the discovery that pushed me off IE forever.
        
               | lstamour wrote:
               | I vaguely remember that being the start of the browser
               | prompts to set your current browser as the default. It
               | was so hard to just configure that they had to build a
               | way to set it within the browser.
               | 
               | You saw that again in more modern times when Microsoft
               | removed support for the APIs they provided to set browser
               | defaults, forcing browser makers to write step by step
               | instructions on what to click to set the default browser.
               | 
               | I believe they walked that back, but it left such a bad
               | taste that I switched my installation of Windows from
               | default mode to EU mode in order to avoid it. And come to
               | think of it, I haven't used my windows machine for much
               | outside of AI in about 6 months.
               | 
               | But Microsoft is not alone in these sort of defaults
               | games - every OS or browser maker, Apple, Google,
               | Firefox, wants to create moats so they can more easily
               | monetize your usage of a product. I never thought I'd
               | prefer the business model of free to play games, where
               | they just outright ask you for money and have to keep
               | finding new ways to entertain instead of relying on hard
               | to change defaults and selling your data.
        
               | charcircuit wrote:
               | An app being able to see itself as the default browser
               | sounds like such a dangerous API, especially if it can be
               | done silently without the user realizing it.
        
             | atomicnumber3 wrote:
             | Was it a scandal at the time? My understanding of how per-
             | game card-driver optimizations work today is:
             | 
             | 1. AAAA Game Studio shits out another unoptimized clunker
             | 
             | 2. nvidia considers it a reputational risk if games run at
             | 30 FPS on a 5090
             | 
             | 3. They go in, look at the perverse ways the game misuses
             | rendering primitives, and then hacks shit in to make
             | whatever bad things they're doing less bad.
             | 
             | As a gamer, this seems fine to me and i generally blame the
             | AAAA devs for being bad at their jobs or AAAA studio leads
             | for being ok shipping unoptimized messes.
        
               | btbuilder wrote:
               | I believe the driver silently swapped the textures to
               | lower quality ones that looked worse but gave a
               | performance boost.
        
               | antonvs wrote:
               | > As a gamer, this seems fine to me
               | 
               | As a software developer, it almost certainly has a bad
               | effect on the ecosystem long term. "Hacks shit in" is the
               | very definition of technical debt, and that has a cost
               | that someone, somewhere is going to have to pay in some
               | form.
        
               | monkpit wrote:
               | You're looking as a dev, but the reality is that a
               | consumer cannot see technical debt. If the studio churns
               | out a game, the vendor sprinkles on some optimizations,
               | people play it and move on, then the tech debt just
               | vaporizes into the void. It's not real at that point.
        
               | wtetzner wrote:
               | Just because a consumer can't see technical debt doesn't
               | mean they aren't paying for it. Most game studios
               | continue to re-use code, so it doesn't just "vaporize"
               | into the void.
        
               | 8n4vidtmkvmk wrote:
               | I'm pretty sure I pay this debt with lost FPS and every
               | time I glitch through the floor into the nether.
        
               | RHSeeger wrote:
               | I can't reply to the person that replied to you, so
               | 
               | > You're looking as a dev, but the reality is that a
               | consumer cannot see technical debt.
               | 
               | The consumer can't _see_ technical debt, but they sure as
               | heck can be impacted by it.
               | 
               | - Technical debt means the code base is harder to work
               | with later. So fixes/enhancements take longer to make it
               | into the code (and sometimes never can)
               | 
               | - This particular type of technical debt means the code
               | by the game developers sets precedent, and the next
               | developer may us it as an example. So the amount of code
               | incorrectly using the api grows faster over time
        
               | strbean wrote:
               | For some reason HN sometimes hides the reply button on
               | leaf comments. I think this only happens for very new
               | comments.
               | 
               | You can click the timestamp ("X minutes ago") to view the
               | comment without context, and reply from there.
        
               | monocasa wrote:
               | I think it's a anti flamewar tactic to put the brakes on
               | quick replies.
        
               | charcircuit wrote:
               | >the next developer may us it as an example
               | 
               | These hacks are game specific, so another developer
               | wouldn't get them.
        
               | lsaferite wrote:
               | The next developer at that company that uses or
               | references the crappy code for another project would
               | still have the issue, but not get the benefit of the
               | down-stream GPU vendor hacks to fix the buggy game.
        
               | RHSeeger wrote:
               | The way the API was used incorrectly "worked", and the
               | game didn't see the negative impact of it because it was
               | "fixed away". And then the incorrect usage is used again
               | on another game and doesn't get the "fixed away" benefit.
               | And the same incorrect usage could happen over and over
               | because "it works".
        
               | SideQuark wrote:
               | > technical debt, and that has a cost that someone,
               | somewhere is going to have to pay in some form
               | 
               | There is no reason anyone has to pay each and every iota
               | of technical debt. Plenty of things with technical debt
               | hit end of life and no one ever looks in that code again.
               | I suspect most technical debt goes this way - in program,
               | program never updates (or minor updates), then dies.
               | 
               | Your claim would require every piece of technical debt in
               | anything ever (code, buildings, cars, anywhere) has to be
               | removed before the thing goes end of life or goes into a
               | mode where it never is changed. That seems ludicrous to
               | me.
        
               | cyanydeez wrote:
               | Does anyone talk about how technical debt often just gets
               | thrown into the garbage so we can buy fancy new technical
               | crap, and its what pays for most of yalls jobs.
        
               | itsTyrion wrote:
               | it rendered in lower quality, IIRC lower textures / much
               | more aggressive mipmapping and/or LOD
        
               | gmueckl wrote:
               | Except that if a developer has that kind of market pull,
               | nVidida will gladly help those devs with getting it
               | right. They are excellent at maintaining developer
               | relations.
        
               | mcculley wrote:
               | I was surprised to see "AAAA". I didn't know there were 4
               | As now.
               | 
               | "AAAA Game Studio shits out another unoptimized clunker"
               | seems a paradoxical statement to me. I would have thought
               | "AAAA" meant "highly resourced" game company. Does it
               | just mean high revenue? Lots of players?
        
               | mwpmaybe wrote:
               | High price...
        
               | vinceguidry wrote:
               | The more money you throw at an effort, the more gets
               | flushed out as waste, and the harder it is to maintain
               | quality. Pretty universal across business.
        
               | wtetzner wrote:
               | AAA/AAAA just means "how much money was spent developing
               | the game". High cost doesn't automatically equal high
               | quality. In fact, it seems after a certain point to mean
               | the opposite.
        
               | bigfishrunning wrote:
               | AAAA isn't a real thing, it's a memey joke based on a
               | press release by a microsoft studio that was closed
               | before ever releasing a single game
        
               | toast0 wrote:
               | > Was it a scandal at the time?
               | 
               | Yes. My understanding was it was optimized by reducing
               | precision or something to a visibly apparent degree.
               | 
               | It's different if the driver changes things in ways such
               | that rendered output is the same or at least
               | imperceptibly different. I think there's also a lot more
               | communication between gpu makers and game/engine
               | developers these days; plus a lot more frequent updates.
        
               | KronisLV wrote:
               | > My understanding was it was optimized by reducing
               | precision or something to a visibly apparent degree.
               | 
               | If only we had that sort of a control over rendering for
               | every game ourselves - since projects like OptiScaler at
               | least let us claw back control over sometimes proprietary
               | upscaling and even framegen, but it's not quite enough:
               | https://github.com/optiscaler/OptiScaler
               | 
               | I'd also mention Lossless Scaling here, though it still
               | only works on upscaling and framegen and with worse
               | methods, but at least works for most games out there: htt
               | ps://store.steampowered.com/app/993090/Lossless_Scaling/
               | 
               | I want to be able to freely toggle between different
               | types of AA and SSAO and reflections and lighting and LOD
               | systems and various shader effects (especially things
               | like chromatic aberration or motion blur) and ray tracing
               | and all that, instead of having to hope that the console
               | port that's offered to me has those abilities in the
               | graphics menu and that whoever is making the decisions
               | hasn't decided that actually "low" graphics (that would
               | at least run smoothly) would look too bad for the game's
               | brand image or something.
        
       | PLenz wrote:
       | The Volkswagon emissions testing model
        
       | rowanG077 wrote:
       | Let's hope for Nvidia this is an innocent optimization only valid
       | for internal kernels that cannot be applied in general.
        
         | jagrsw wrote:
         | In which case checking for a string inside arbitrary name is
         | sloppy (a bug).
        
       | high_na_euv wrote:
       | I have small experience with compilers and llvm but youd be
       | shocked how many things rely on names and parsing names
       | 
       | If you have hundreds of passes that are complex and rely on
       | various "contracts" like type names or some shit, then really
       | crazy things like this can happen unintentionally and not
       | maliciously
        
         | diggan wrote:
         | Web-developers are well aware of this too. Sincerely,
         | Mozilla/5.0 (X11; Linux x86_64; rv:139.0) Gecko/20100101
         | Firefox/139.0
        
           | bravesoul2 wrote:
           | Funny we send a browser wars tombstone in every request!
        
             | antonvs wrote:
             | Let's have a moment of silence for Gecko/20100101
        
         | halJordan wrote:
         | Why would i be shocked that a name is informative. Like... are
         | you surprised that wrought iron is wrought? Or cast iron is
         | made from a cast?
        
           | IAmBroom wrote:
           | Dog piles are often neither composed of dogs, nor actual
           | piles.
           | 
           | Names can be both informative, and misdirecting, at the same
           | time.
        
         | the8472 wrote:
         | Some names are standardized items, like memcpy. Matching those
         | is ok, nothing sneaky going on there. Matching something
         | vendor-specific in a general-purpose API is different story.
        
       | giingyui wrote:
       | And what's the downside of using that kernel name? It can't just
       | be that it's faster and nothing else. Unless they included lots
       | of sleep(x) calls.
        
         | samus wrote:
         | There might be optimizations that are only safe for the code
         | that this was an intender for.
        
           | bialpio wrote:
           | Seems like a bad idea to rely on a name for deciding this
           | then, unless it's documented somewhere that using names
           | containing certain substrings may trigger unsafe
           | optimizations...
        
       | Arch-TK wrote:
       | I wish people either learned how to use git or just wholesale
       | stopped using it.
        
       | tempaway43563 wrote:
       | So, what is Cutlass, can someone explain whether checking for
       | kernel names makes sense here or is a form of cheating?
       | 
       | https://docs.nvidia.com/cutlass/index.html
        
         | rurban wrote:
         | That's strange because the cutlass docs explicitly does NOT
         | mention fp8 support. So it looks like it can be used
         | nevertheless with fp8 by using the name hack.
        
           | mlazos wrote:
           | It supports e5m2 and e4m3 right in the doc linked.
        
         | gpm wrote:
         | Github version: https://github.com/NVIDIA/cutlass
         | 
         | I wonder if we search the comments if we can find something
         | referencing this.
        
       | zahlman wrote:
       | This tweet appears to be taking the original material out of
       | context to misrepresent it:
       | 
       | > Rewrite the attention kernel to be persistent. This gives
       | better performance at low-contexts. However, fp16 at large
       | context has suffered a bit due to a ptxas instruction scheduling
       | issue in the softmax partition. fp8 is ~100 tflops faster when
       | the kernel name has "cutlass" in it.
       | 
       | The charitable reading is that, on certain kernels, _using fp8
       | rather than fp16 values_ gives better performance. (Although I
       | can 't even see how the numbers relate to a "~100 tflops faster"
       | claim in any respect, nor does it even list any kernel names or
       | suggest a control kernel!) But this is being presented as if
       | someone has uncovered evidence of cheating on benchmarks.
        
         | saagarjha wrote:
         | I think you're the one doing that to the tweet, actually.
        
           | zahlman wrote:
           | What are you talking about? When I view the tweet, the _only_
           | text I see is:
           | 
           | > > fp8 is 100 tflops faster when the kernel name has
           | "cutlass" in it
           | 
           | > kms
        
             | saagarjha wrote:
             | And it includes a link to show that this is the context it
             | came from.
        
               | zahlman wrote:
               | And when I look at the link, the part I quoted is the
               | relevant text I see.
               | 
               | In order to get to the part that you're trying to hold me
               | accountable for, I would furthermore have to click onto
               | the commits tab and search through a 93-commit PR.
               | 
               | I thought today I was using a site where trying to think
               | the best of people and propose that someone had taken
               | something out of context, based on the immediately
               | available context having a simpler explanation, would not
               | get me treated like a corporate shill (for a company I
               | don't even care about). Apparently I was wrong.
        
               | saagarjha wrote:
               | I don't think you are a corporate shill. I do think that
               | you immediately going "clearly the tweet is wrong"
               | without doing any research whatsoever was unwarranted,
               | though. You also keep bringing up that it's 93 commits
               | but all getting squashed you have to do is search for
               | "cutlass" to find out what is going on. I think you're
               | obligated to do at least that when you call it out for
               | being wrong.
        
         | zettabomb wrote:
         | No, that sentence is separate from the rest. Take a look at the
         | pull request:                   # Up to 150 TFLOPS faster for
         | fp8!         if specialization.constants["dtype"] ==
         | gl.float8e5:             name = "cutlass_" + name
        
           | zahlman wrote:
           | The tweet is quoting from the first message in the
           | "conversation" on the PR. There are 93 commits in the PR and
           | GitHub doesn't even default to that tab. I looked at the
           | obvious text and drew the conclusion that was obvious to me.
        
         | imtringued wrote:
         | https://github.com/triton-lang/triton/pull/7298/commits/a5e2...
         | 
         | It's literally in the code.
        
           | zahlman wrote:
           | I already had to deal with Twitter and a link shortening
           | service just to get to GitHub and then it still only pointed
           | to the facing page of a 93-commit PR.
        
       | spoaceman7777 wrote:
       | Seems this is likely due to ongoing work on FP8 support on
       | nvidia/cutlass. From my reading, the alternative code path was
       | likely added recently for testing by external contributors to the
       | cutlass project, and other involved parties. (Rather than
       | attempting to distribute custom packaged internal builds of
       | cuda.)
       | 
       | This ticket is a good starting place to see the chain of issues
       | around the ongoing work:
       | https://github.com/NVIDIA/cutlass/pull/2037
        
       ___________________________________________________________________
       (page generated 2025-07-11 23:01 UTC)